Lecturer: Dr. C. M. Jones-Todd
With thanks to Prof. Chris Triggs, Dr Kathy Ruggiero, and Prof. James Russell who all freely gave me their previous iterations of the course and told me to do what I wanted with their notes. I did, and they may well find sections of this material hauntingly familiar! Credit also to all the other fantastic people who’ve shared materials or have made notes freely available online. I have endeavored to reference all materials, and more, where appropriate.
© 2024, Dr. C. M. Jones-Todd
Please read this page carefully as it contains important information about the course structure, delivery, and assessment.
This is a postgraduate course geared towards students of biology, ecology, and environmental science. Building on a strong foundation in quantitative biology, fundamental statistical methods and basic R programming, you will learn an array of advanced biostatistical methods for data analysis. Topics covered include data wrangling, methods for the analysis of designed experiments, regression analysis, including mixed effect models, and the analysis of multivariate data, including advanced supervised and unsupervised learning techniques.
The course builds on assumed knowledge of some fundamental statistical concepts. It is expected that you are comfortable with the statistical content covered in a typical undergraduate (bio)statistics course (e.g., linear regression, hypothesis testing etc.). In addition, we will use the programming language R (through RStudio) and you are expected to be familiar with data import, manipulation, and visualization. If you are unfamiliar with R it is expected that you will prepare accordingly before the semester begins. I recommend that you recap the content of BIOSCI220.
Before the course begins you are required to complete the pre-course tasks.
These tasks are designed for you to self-assess your R and statistics knowledge. If you fail to complete the coding task and score less than 70% on the quiz you will likely struggle with the pace and content of the course. This does not preclude you from taking the course, however, you should expect to invest time on top of what is expected for a postgraduate course into ensuring that you understand the material.
You are expected to attend all lectures and your attendance will be recorded.
There is no test or exam for this course
Assessment summary:| Assignments | R-tasks | Project |
|---|---|---|
|
|
|
Accessibility: If there is any portion of the course that is not accessible to you due to challenges with technology or the course format, please let me know so we can make appropriate accommodations.
TL;DR: Don’t cheat!
Please abide by the following as you work on assignments in this course:
The due dates for assignments are there to help you keep up with the course material and to ensure the teaching team can provide feedback within a timely manner. However, I understand that things come up periodically that could make it difficult to submit an assignment by the deadline. If this is the case for you it is imperative that you get in contact with me as soon as possible so that we might make alternative arrangements.
Key topics
This course is not designed as a course in R. Throughout we use R and RStudio as tools to carry out statistical inference, but it is assumed that you are already familiar with the basics of these software. You will be led through multiple examples in the chapters to come and it is expected that you invest time working through them, working out what each line of code does. It is inadvisable that you simply copy & paste the examples without an idea of what is being asked. Figuring out the code for yourself will be a huge help when you come to modify it to suit your own purposes.
Rather than telling you how to carry out specific operations this section of the course guide outlines practices and tools focusing on employing good scientific practice. We discuss the best practices we should employ when dealing with data, code, and our obligations in drawing inferences from our analyses. There are many ongoing discussions in the scientific community around ethical data practice and what it entails. It is a hugely important subject, and in many ways has a long way to go. We only touch briefly on some aspects of ethical data practice in this section, presenting the suggestions and thoughts of the relevalnt communities and organisations in those areas.
Accuracy and Honesty: Being accurate and honest in your analyses and conclusions.
Respectful Handling of Data: Recognize that data may represents people or beliefs or behaviour and be respectful of this.
Awareness of Consequences: Considering the ethical implications and societal impact of your research.
TASK Read Ethical Guidelines for Statistical Practice and outline which principle(s) you feel directly relate to your current career pathway.
R and RStudioThroughout this course we will be programming using R. R is a free open-source statistical programming language created by Ross Ihaka (of Ngati Kahungunu, Rangitane and Ngati Pakeha descent) and Robert Gentleman, here at UoA in the Department of Statistics! It is widely used across many scientific fields for data analysis, visualization, and statistical modeling. Proficiency in R will enable you to wrangle and explore datasets, conduct statistical analyses, and create visualizations to communicate your findings. These are all essential tools required in any scientific discipline.
TASK Go to the Statistics Department display in the science building foyer (building 303) to see a CD-Rom with the first ever version of R.
RStudio is an integrated development environment (IDE) for the R programming language. It serves as a user-friendly workspace, offering additional tools for coding, data visualization, and project management. RStudio simplifies the process of learning R by providing a structured interface with many built-in tools to help organize workflow, fostering a systematic approach to data analysis and research.
TASK Research the meaning of open-source software and briefly outline the pros and cons of this in the context of statistical analysis.
R terminology| Term | Description |
|---|---|
Script |
A file containing a series of R commands and code that can be executed sequentially. |
Source |
To execute the entire content of an R script, often done using the “Source” button in RStudio. |
Running Code |
The process of executing R commands or scripts to perform specific tasks and obtain results. |
Console |
The interactive interface in RStudio where R commands can be entered and executed line by line. |
Commenting |
Adding comments to the code using the # symbol to provide explanations or annotations. Comments are ignored during code execution. |
Assignment Operator |
The symbol <- or = used to assign values to variables in R. |
Variable |
A named storage location for data in R, which can hold different types of values. |
Data Type |
The classification of data into different types, such as numeric, character, logical, etc. |
Object |
A data structure that holds a specific type of data. Objects are used to store and manipulate data, and they can take various forms depending on the type of information being represented. |
Logical Operator |
Symbols like ==, !=, <, >, <=, and >= used for logical comparisons in conditional statements. |
Function |
A block of reusable code that performs a specific task when called (e.g., mean(c(3, 4)). |
Argument |
The input values that are passed to the function when it is called. |
Error Handling |
The process of anticipating, detecting, and handling errors that may occur during code execution. |
Debugging |
The process of identifying and fixing errors or bugs in the code. |
Workspace |
The current working environment in R, which includes loaded data, variables, and functions. |
R syntax?This is not a comprehensive list of R syntax, but a call to develop your own coding style and use whatever you are most comfortable with. Throughout this course I will likely use a mix of syntax and functions to carry out the same operations. This is (somewhat) by design, and is intended to expose you to a strength1 of R, which is that there are multiple ways of doing the same thing. As long as it works and is reproducible then whatever works for you, although of course there are some recommended practices, see Good coding practice .
Example <- vs =
Both <- and = are assignment operators. It is mainly a personal choice which you use2. However, there is a difference depending on which environment you evaluate the assigning. In the top level there is no difference; however, = does not act as an assignment operator in a function call, whereas <- does.
TASK Both lines of code below give you a two row matrix, can you work out what the difference is between them (HINT look at what is created in your environment after each line)?
matrix("BIOSCI78", nrow = 2)
matrix("BIOSCI78", nrow <- 2)
Example %>% vs |>
Recall the tidyverse (specifically the magrittr) pipe operator %>%, which allows us to combine multiple operations in R into a single sequential chain of actions. There is also a base pipe |> which acts similarly, but not exactly. Essentially, |> passes the left-hand side as the first argument in the right-hand side call. This is subtly different from %>% as it cannot pass the left-had side onto multiple arguments. The %>% operator, however, does allow for this and you can also change the placement of the left-hand object with a . placeholder3.
TASK Given df <- data.frame(x = 1:10, group = rep(c("A", "B"))) are all the following operations equivalent?
split(df, df$group)
df %>% split(.$group)
df |> split(df$group)
Sometimes rather than doing what you expect it to R will return an Error message to you via the Console prefaced with Error in… followed by text that will try to explain what went wrong. This, generally, means something has gone wrong, so what do you do?
Interpreting R errors is an invaluable skill! The error messages are designed to be clear and informative. They aim to tell might be going wrong in the code. These messages typically contain information about the type of error hinting at how you might fix it. For example, if there is a typo in a function or variable name, R will produce an error like "Error: object 'variable_name' not found.” This suggests that R couldn’t find the specified object variable_name.
Understanding error messages involves paying attention to the details, such as the error type (e.g., syntax error, object not found, etc.), and the specific line number where the error occurred! You should always run your code line-by-line, especially if you are new to programming. This means that the execution of each line of code is done in in isolation, providing immediate feedback and pinpointing the exact location where an error occurs. If the meaning or solution to an error isn’t immediately obvious then make use of the documentation (RTFM) and even technology. Online platforms like Stack Overflow or RStudio Community and tools like ChatGPT can often be your friend! It is very unlikely that you are the first one to have encountered the error and some kind soul with have posted a solution online (which the AI bot will have scraped…). However, it’s crucial to approach online answers carefully. You need to first understand the context of your error and use your knowledge about your issue to figure out if the solution is applicable (this gets easier with experience).
It is always best to try debugging yourself before blindly following a stranger’s advice. Debugging is a crucial aspect of R programming and the process itself helps solidify your understanding. There are several widely used methods that help identify and resolve issues in your code, two are outlined below.
"Error: object 'variable_name' not found. I might ask R to print out what objects do exist in my workspace, it could be that a previous typo means that the object variable_nam exists.The table below summaries some common R issues and errors and their solutions.
| Error/Issue | Description | Solution |
|---|---|---|
| Syntax Error | Incorrect use of R syntax, such as missing or mismatched parentheses, braces, or quotes. |
Carefully review the code, check for missing or mismatched symbols, and ensure proper syntax is used. |
| Object Not Found | Attempting to use a variable or function that hasn’t been defined or loaded into the workspace. | Check for typos in variable or function names, ensure the object is created or loaded, and use correct names. |
| Package Not Installed/Loaded | Trying to use a function from a package that is not installed or loaded into the environment. | Install the required package using install.packages("package_name"), and load it using library(package_name). |
| Undefined Function/Variable | Using a function or variable that hasn’t been defined or is out of scope. | Define the function or variable, or check its scope within the code. |
| Data Type Mismatch | Operating on data of incompatible types, such as performing arithmetic on character data. | Ensure that data types are compatible, and consider using functions like as.numeric() or as.character() for conversions. |
| Misuse of Assignment Operator | Using the wrong assignment operator (= instead of <- or vice versa). |
Be consistent with the assignment operator, and use <- for assignment in most cases. |
| Missing Data | Dealing with missing values in the data, leading to errors in calculations or visualizations. | Handle missing data appropriately, using functions like na.omit() or complete.cases(). |
| Index Out of Bounds | Attempting to access an element in a vector or data frame using an index that is out of range. | Check the length of the vector or data frame and ensure the index is within bounds. |
| Failure to Load a File | Issues with loading a data file using functions like read.csv() or read.table(). |
Check the file path, file format, and encoding. Confirm that the file exists in the specified location. |
| Incorrect Function Arguments | Providing incorrect arguments or parameters to a function, leading to errors. | Refer to the function’s documentation to understand the correct parameters and ensure proper usage. |
Sometimes your computer will return a warning messages to you prefaced “Warning:”. These can sometimes be ignored as they may not affect us. However, READ THE MESSAGE and decide for yourself. Occasionally, also your computer will write you a friendly message, just keeping you up-to date with what it’s doing, again don’t ignore these they might be telling you something useful!
TASK Keep a bug diary! Each time you get an error message, see if you can solve it and keep a diary of your attempts and solution.
Honesty is an expectation: we expect honesty from you and you expect the same from your teaching team. Honesty is an expectation in any scientific discipline as is accuracy. These are morals, ethical principles we should abide by. But this course isn’t here to discuss philosophy or character development. This course, in particular this section of the course, aims to expose you to the tools and principles that will aid you in your own pursuit of ethical data practice. Teaching you the tools so that your analysis is reproducible goes someway towards ensuring accuracy in your research. This because, reproducibility promotes transparency, facilitates error detection and correction, and contributes to the overall reliability and accuracy of your research findings.
“Reproducibility, also known as replicability and repeatability, is a major principle underpinning the scientific method. For the findings of a study to be reproducible means that results obtained by an experiment or an observational study or in a statistical analysis of a data set should be achieved again with a high degree of reliability when the study is replicated. … With a narrower scope, reproducibility has been introduced in computational sciences: Any results should be documented by making all data and code available in such a way that the computations can be executed again with identical results.”
Reproducibility is a stepping stone towards ensuring accuracy. This is because, reproducibility promotes transparency, facilitates error detection and correction, and contributes to the overall reliability and accuracy of your research findings. Establishing good practice when dealing with data and code right from the beginning is essential. Good practice 1) ensures that data is collected, processed, and stored accurately and consistently, which helps maintain the quality and integrity of the data throughout its lifecycle; and 2) creates a robust code base, which can be easily understood and adapted as the project progresses, which leads to faster development.
You should always start with a clean workspace. Why? So your ex (code) can’t come and mess up your life!
To ensure that RStudio does not load up your previous workspace go to Tools > Global Options and uncheck the highlighted options below.
The reasoning may not be immediately obvious; however, it is something you will later regret if you don’t start as you mean to go on! Loading up a previous workspace may seem convenient as your previous objects and code are immediately on-hand. However, this is the exact reason that it is not good practice, loading up a previous workspace is NOT reproducible, does NOT create a fresh R process, makes your script vulnerable, and it will come back to bite you.
TASK Below are two quotes from Jenny Bryan, an R wizard which reference two snippets of R code. Find out what each snippet does and why Jenny is so against them.
If the first line of your R script is
setwd("C:\Users\jenny\path\that\only\I\have")
I will come into your office and SET YOUR COMPUTER ON FIRE 🔥.
If the first line of your R script is
rm(list = ls())
I will come into your office and SET YOUR COMPUTER ON FIRE 🔥.
A project-oriented workflow in R refers to a structured approach to organizing and managing your code, data, and analyses. This helps improve reproducibility and the overall efficiency of your work. Within this it is essential essential to write code that is easy to understand, maintain, and share. To do so, coding best practice is to follow the 5 Cs by being
testthat) to verify that your functions work as intended.usethis package to help set up your package structure in a conformant way.TASK Read Good enough practices in scientific computing and briefly summarise why good coding practices are key to any scientific discipline.
There are many other good practice tips when it comes to coding these include ensuring your code is modular, implementing unit testing, automating workflows and implementing version control. In this course you will be using git to manage your project code and data. Use of git, or similar, will very likely be an expectation of your future career, see Version control with git and GitHub for an introduction to these tools.
| Item | What | Why |
|---|---|---|
| Be nice to your computer: | No white spaces as some systems can be confused by them | Ensures compatibility across different systems and prevents errors |
| No special characters (e.g., *, ^, +, …) | Prevents interpretation issues and potential conflicts with system functions | |
| Don’t assume case is meaningful | Avoids confusion on systems that do not differentiate by case | |
| Consistency is KEY | Avoids confusion in general | |
| Be nice to yourself and your collaborators | Concise and descriptive names | Just makes life easier :) |
| Make sorting and searching easy | Dates should follow YYYY-MM-DD format | Standardizes date representation makes sorting and interpretation easy |
| Use numbers as a prefix to order files | Enables sequential sorting of files regardless of system | |
| Left pad with 0 so numbers have the same length | Ensures proper numerical order when sorting files | |
| Use keywords | You can search these! |
TASK Work through Danielle Navarro’s presentation on Project Structure and see how many pitfalls you have fallen into to-date.
git and GitHubGit is a version control system that manages the evolution of a set of files, called a repository (repo), in a structured way (think of Word’s Track Changes). With git you can track the changes you make to your project/code. You will always have a record of what you’ve worked on and can easily revert back to an older version if you make a mistake. GitHub is a hosting service that provides a home for your git-based projects on the internet (think of Dropbox). In addition, GitHub offers functionality to use git online via an easy-to-use interface. Both git and GitHub can very easily be configured to work with RStudio.
Below are some key terms you will undoubtedly hear when delving into the git–GitHub world.
Repository (already mentioned) This where the work happens–think of it as your project folder. It should contain all of your project’s files etc.
Cloning A repository on GitHub is stored remotely in the cloud. To create a local copy of this repository you can clone it and use Git to sync the two.
Committing and pushing are how you can add the changes you made on your local machine to the remote repository in GitHub. You can make a commit when you have made milestone worthy changes to your project. You should also add a helpful commit message to remind future you, or your teammates, what the changes you made were (e.g., fixed the bug in my_function).
The table below gives Ten Simple Rules for Taking Advantage of Git and GitHub, which outline how to get the most out of using these softwares.
| Title | Why | |
|---|---|---|
| Rule 1 | Use GitHub to Track Your Projects | Keeps your work backed up and in order |
| Rule 2 | GitHub for Single Users, Teams, and Organizations | Flexible repo rights helps project management |
| Rule 3 | Developing and Collaborating on New Features: Branching and Forking | Easily copy and create your own version of a project to modify |
| Rule 4 | Naming Branches and Commits: Tags and Semantic Versions | Consistency helps your collaborators and end-users |
| Rule 5 | Let GitHub Do Some Tasks for You: Integrate | Continuous integration helps make sure your code is ready to go as soon as possible |
| Rule 6 | Let GitHub Do More Tasks for You: Automate | Automating tasks means less manual work and more reliable testing |
| Rule 7 | Use GitHub to Openly and Collaboratively Discuss, Address, and Close Issues | Promotes collaboration |
| Rule 8 | Make Your Code Easily Citable, and Cite Source Code! | Just overall good practice and aids reproducibility |
| Rule 9 | Promote and Discuss Your Projects: Web Page and More | Promoting your work is likely always good for your career |
| Rule 10 | Use GitHub to Be Social: Follow and Watch | GitHub is a nice way to follow developments in your field and see others work |
TASK Read Ten Simple Rules for Taking Advantage of Git and GitHub and briefly expand on the points above using examples from your studies/careers to-date.
RR.version.string
## [1] "R version 4.3.2 (2023-10-31)"
RStudio to the new preview version (optional)GitHub using RStudioGitHub, navigate to the Code tab of the repository and on the right side of the screen, click Clone or download.Copy to clipboard icon to the right of the repository URL
RStudio in your local environmentFile, New Project, Version Control, Git
Project directory name field.
Create Project. Your Files pane should now look similar to this
README.md). Note that the Git pane (top right) is empty
Git pane (top right) is not empty:
Check this file in the Git tab (it is now staged for commit).
Click the Commit button. A new pane will open. Changes made to the file will be highlighted (additions in green and deletions in red). Now write your self an informative message in the top right of this pop-up:
Now you’ll see RStudio has left you a little message in the Git tab, something similar to Your branch is ahead of origin/master by 1 commit. This means that you’ve made and committed your changes locally (i.e., on your computer) but you are yet to push these changes to GitHub.
To push to GitHub press the Push button,
A new pop up will let you know how things are going! You can then close this once it gives you the option to.
As with any new software there are likely to be a few teething issues! Below are some previous classmate’s solutions4 to issues encountered when commiting and pushing changes using git via RStudio. These initial issues are likely due to authentication.
config fileAssuming your current working directory is at the top level of your repo/project then choose the Files tab in your plotting pane.
.git folder and open the file named config:
config and add the following code replacing the relevant bits with your GitHub username/handle and email.[user]
name = <adding your GitHub username here>
email = <adding your GitHub email here>
Using your Github email and password as authentication was decommissioned a while ago. Instead a Personal Access Token is required.
Sign into Github and follow these instructions to create a Personal Access Token
Once completed a new popup will appear with what looks like a random jumble of letters and numbers (in the format ghp_*************), this is your Personal Access Token. Copy this! Or put it in your notes, or password manager, for later
Return to RStudio clone your desired repository (e.g., the one for the assignment):
You will now be asked for your GitHub details:
Note, depending on the choices you made when creating your Personal Access Token, it may run out (the default duration is 30 days) and your authentication stop working. No worries, simply create a new one one use this as your password!
The term data sovereignty has a hugely broad range of connotations. We do not aim to cover them all in this course. Typically, data sovereignty is focused on the understanding and adhering to the legal and ethical considerations associated with data collection, storage, and analysis in different jurisdictions. Researchers are expected follow and abide by the best ethical and legal practice whilst respecting individuals’ privacy and rights.
“For Indigenous peoples, historical encounters with statistics have been fraught, and none more so than when involving official data produced as part of colonial attempts at statecraft.”
Indigenous data sovereignty refers to the right or interest indigenous peoples and nations have to govern the collection, ownership, and application of their own data. As we are in Aotearoa this is most pertinent in the terms of māori data sovereignty.
“Māori Data Sovereignty has emerged as a critical policy issue as Aotearoa New Zealand develops world-leading linked administrative data resources.”
Te Tiriti o Waitangi/Treaty of Waitangi obliges the Government to actively protect taonga, consult with Māori in respect of taonga, give effect to the principle of partnership and recognize Māori rangatiratanga over taonga.
TASK There are many ongoing discussions that surround the Te Tiriti o Waitangi/Treaty of Waitangi, focused mainly on whether the statements of principles it outlines have been upheld. What do you think? Can you find a recent event/news article that relates to your studies, which either clearly upholds, or not, what you deem to be the correct ethical practice here?
Māori Data Sovereignty principles inform the recognition of Māori rights and interests in data, and promote the ethical use of data to enhance Māori wellbeing.
The Te Mana o te Raraunga Model was developed to align Māori concepts with data rights and interests, and guide agencies in the appropriate use of Māori data. Below are the guiding principles outlined by Te Mana o te Raraunga Model in th e Principles of Māori Data Sovereignty.
Rangatiratanga (authority)
Whakapapa (relationships)
Whanaungatanga (obligations)
Kotahitanga (collective benefit)
Manaakitanga (reciprocity)
Kaitiakitanga (guardianship)
Considering the implications and societal impact of your research includes ensuring that any conclusions you draw are appropriately and accurately balanced. Consider the previous chapter Data Visualization, the guiding principles of making informed visualizations included not misleading readers and prioritizing conveying a clear message. These speak to being mindful of how your figures may be perceived and presenting your data ethically and responsibly.
Your responsibilities go beyond just making figures, they extend to the methods and inferences you draw. Learning how to communicate science is a key and invaluable skill. Siouxsie Wiles, is an award winning science communicator and is perhaps best known for stepping up during the pandemic giving us information about the virus and advice on how to beat it.
“I assumed it would be through my research by helping develop a new antibiotic. But through the pandemic, I’ve learned that I can have a huge impact globally by doing good science communication.”
Below is a case study in science (mis)communication.
Asthma carbon footprint ‘as big as eating meat’ is the headline of this news story published on the BBC website. The article is based research outlined in this published paper which in turn cites this paper for an estimate of the carbon footprint reduced by an individual not eating meat.
It is not unreasonable to assume that many people would interpret this to mean that the total global carbon footprint due to eating meat is equal to the total carbon footprint due to the use of asthma inhalers. However, this is not what they mean. They mean that an individual deciding not to eat meat reduces their carbon footprint as much as an asthmatic individual deciding not to use an inhaler.
There are far more meat consumers compared to inhaler users and so the overall carbon footprint associated with meat consumption is much greater. However, the claim that not eating meat reduces someone’s carbon footprint about the same amount as not using inhalers is questionable. Yet, both the BBC article and the paper make this claim.
“And at the individual level, each metered-dose inhaler replaced by a dry powder inhaler could save the equivalent of between 150kg and 400kg (63 stone) of carbon dioxide a year - similar to the carbon footprint reduction of cutting meat from your diet.”
“Changing one MDI device to a DPI could save 150–400 kg CO2 annually; roughly equivalent to installing wall insulation at home, recycling or cutting out meat.”
Now, the the carbon footprint of eating meat is estimated as 300–1600 kg CO2 annually by this paper (see Table 1). And so the two claims don’t really match up. Moreover, what is being suggested by the article? That should asthmatics should think about ceasing their medication in the same way many people are trying to reduce meat consumption?!?
In this section we’ve discussed how ethical data practice involves accuracy, respect, and clear communication. There is one other component that should be considered here and that is consequence. The two options in this case study are not balanced because they have very different consequences:
TASK Watch this lecture Algorithmic fairness: Examples from predictive models for criminal justice and summarise the key points made. Can you think of a recent story that highlights the issues raised?
The guiding principles of making informed visualizations included not misleading readers and prioritizing conveying a clear message. These speak to being mindful of how your figures may be perceived and presenting your data ethically and responsibly.
“Scientific visualization is classically defined as the process of graphically displaying scientific data. However, this process is far from direct or automatic. There are so many different ways to represent the same data: scatter plots, linear plots, bar plots, and pie charts, to name just a few. Furthermore, the same data, using the same type of plot, may be perceived very differently depending on who is looking at the figure. A more accurate definition for scientific visualization would be a graphical interface between people and data.”
Exploratory plots are just for you, they focus solely on data exploration. They
“…have obligations in that we have a great deal of power over how people ultimately make use of data, both in the patterns they see and the conclusions they draw.”
Explanatory plots are mainly for others. These are the most common kind of graph used in scientific publications. They should
The table below summarises Ten Simple Rules for Better Figures, a basic set of rules to improve your visualizations.
| Rule Name | Rule Description |
|---|---|
| Know Your Audience | Understand the characteristics and expertise of your audience to tailor the figure accordingly. |
| Identify Your Message | Clearly define the main message or takeaway that you want the audience to understand from the figure. |
| Adapt the Figure to the Support Medium | Tailor the figure’s complexity and design to suit the medium it will be presented in (e.g., print, online). |
| Captions Are Not Optional | Craft informative captions that provide essential details and context for interpreting the figure. |
| Do Not Trust the Defaults | Adjust default settings to optimize the visual elements of the figure, such as colors, scales, and labels. |
| Use Color Effectively | Apply color purposefully, taking into account accessibility considerations and cultural interpretations. |
| Do Not Mislead the Reader | Avoid creating misleading visualizations and be aware of formulas to measure the potential misleading nature of a graph. There are formulas to measure how misleading a graph is! |
| Avoid Chartjunk | Eliminate unnecessary decorations and embellishments in the figure that do not contribute to the message. |
| Message Trumps Beauty | Prioritize conveying a clear message over making the figure aesthetically pleasing. |
| Get the Right Tool | Choose the appropriate visualization tool (e.g., R) or chart type that best communicates the data and intended message. |
TASK Read this short paper and find examples in your choice of literature of one or more of the rules in action.
Key topics
Statistical inference is the process of deducing properties of an underlying distribution by analysis of data. The word inference means conclusions or decisions. Statistical inference is about drawing conclusions and making decisions based on observed data. Data, or observations, typically arise from some underlying process. It is the underlying process we are interested in, not the observations themselves. Sometimes we call the underlying process the population or mechanism of interest. The data are only a sample from this population or mechanism. We cannot possibly observe every outcome of the process, so we have to make do with the sample that we have observed. The data give us imperfect insight into the population of interest. The role of statistical inference is to use this imperfect data to draw conclusions about the population of interest, while simultaneously giving an honest reflection of the uncertainty in our conclusions. Hypothesis testing is a form of statistical inference that uses data from a sample to draw conclusions about a population parameter or a population probability distribution.
TASK Imagine that you have a fair coin and toss it 10 times. Write out all the possible outcomes alongside how likely you think each is.
A permutation test is a statistical significance test where the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under all possible rearrangements of the observed data points. This is also known as an exact test.
Now, often finding all possible combinations is hugely computationally expensive so we harness the power of simulation and carry out a randomisation test, which randomly simulates (for a given number of repetitions) possible values of the test statistic under the null hypothesis to obtain its approximate distribution.
The basic approach to a randomisation tests is straightforward:
TASK Study this cheatsheet and link the relevant sections to each step given above.
Below are data specifying mandible lengths (mm) for golden jackals (Canis aureus) of each sex from the British Museum
| Mandible length (mm) | Biological sex |
|---|---|
| 120 | Male |
| 107 | Male |
| 110 | Male |
| 116 | Male |
| 114 | Male |
| 111 | Male |
| 113 | Male |
| 117 | Male |
| 114 | Male |
| 112 | Male |
| 110 | Female |
| 111 | Female |
| 107 | Female |
| 108 | Female |
| 110 | Female |
| 105 | Female |
| 107 | Female |
| 106 | Female |
| 111 | Female |
| 111 | Female |
Scientific question: Are the jaw lengths of jackals the same, on average, in both sexes?
Null hypothesis: The average jaw lengths in male jackals the same as for females
Test statistic: Difference of sample means
Let us first calculate the observed test statistic:
## the data
jackal <- data.frame(mandible_length_mm = c(120, 107, 110, 116, 114, 111, 113, 117, 114, 112,
110, 111, 107, 108, 110, 105, 107, 106, 111, 111),
sex = rep(c("Male","Female"), each = 10))
## observed statistic
jackal_mean_diff <- jackal %>%
group_by(sex) %>%
summarise(mean = mean(mandible_length_mm)) %>%
summarise(diff = diff(mean)) %>%
as.numeric()
jackal_mean_diff
## [1] 4.8
Now we use the command combn(x, m) to generate all possible combinations of the elements of x taken m at a time. For our data we have 20 elements in total with two groups of 10 elements each, therefore, we use combn(20,10). We can use these combinations to generate all possible combinations and calculate the proportion of times the test statistic calculated under the null hypothesis is as least as extreme as the one observed (the p-value):
combinations <- combn(20,10)
## Do the permutations
permtest_combinations <- apply(combinations, 2, function(x)
mean(jackal$mandible_length_mm[x]) - mean(jackal$mandible_length_mm[-x]))
## Full Permutation test p.value
p_val <- length(permtest_combinations[abs(permtest_combinations) >= jackal_mean_diff]) / choose(20,10)
p_val
## [1] 0.003334127
Rather than considering all possible combinations we might rather use 100 random permutations under the null hypothesis. Here, we sample without replacement:
## set up matrix
random_perm <- apply(matrix(0, nrow = 99, ncol = 1), 1, function(x) sample(20))
random_mean_diff <- apply(random_perm, 2, function(x){
z <- jackal$mandible_length_mm[x]
mean(z[jackal$sex == "Male"]) - mean(z[jackal$sex == "Female"])
})
## add the observed
random_mean_diff <- c(random_mean_diff, jackal_mean_diff)
random_p.value <- length(random_mean_diff[abs(random_mean_diff) >= jackal_mean_diff]) / 100 ## note the abs()
random_p.value
## [1] 0.02
How does this compare to the exact (permutation) results above? Try increasing the number of times you randomise? What happens as you increase it?
Remember the Pāua data from the previous module? The dataset contains the following variables
Age of P\(\overline{\text{a}}\)ua in years (calculated from counting rings in the cone)Length of P\(\overline{\text{a}}\)ua shell in centimetersSpecies of P\(\overline{\text{a}}\)ua: Haliotis iris (typically found in NZ) and Haliotis australis (less commonly found in NZ)One question we may want to ask is if on average the shell length differs between Species?
library(tidyverse)
paua <- read_csv("https://raw.githubusercontent.com/STATS-UOA/databunker/master/data/paua.csv")
Scientific question: Are the shell lengths of shells the same in both species? Null hypothesis: The distribution of shell lengths in Haliotis iris the same as in Haliotis australis Test statistic: Difference of sample means
But because the data are skewed and we’ve likely got non-constant variances we may be better off adopting a randomization test (this time using a for loop!), rather than a parametric t-test
## observed differences in means
diff_in_means <- (paua %>% group_by(Species) %>%
summarise(mean = mean(Length)) %>%
summarise(diff = diff(mean)))$diff
diff_in_means
## [1] -0.9569444
## Number of times I want to randomise
nreps <- 999
## initialize empty array to hold results
randomisation_difference_mean <- numeric(nreps)
set.seed(1234) ## *****Remove this line for actual analyses*****
## This means that each run with produce the same results and
## agree with the printout that I show.
for (i in 1:nreps) {
## the observations
data <- data.frame(value = paua$Length)
## randomise labels
data$random_labels <-sample(paua$Species, replace = FALSE)
## randomised differences in mean
randomisation_difference_mean[i] <- data %>%
group_by(random_labels) %>% summarise(mean = mean(value)) %>%
summarise(diff = diff(mean)) %>%
as.numeric()
}
## results, join randomised stats with the observed value
results <- data.frame(results = c(randomisation_difference_mean, diff_in_means))
Let’s calculate how many randomised differences in means are as least as extreme as the one we observed. Note that we library the absolute value as dealing with a two tailed test.
n_exceed <- sum(abs(results$results) >= abs(diff_in_means))
n_exceed
## [1] 2
## proportion
n_exceed/nreps
## [1] 0.002002002
What do you conclude from this proportion? How does this tie in with the distribution of the test statistic under the null hypothesis shown below?
NOTE: We can extend the randomization test to make inference about any sample statistic (not just the mean)
A sampling distribution shows us what would happen if we took very many samples under the same conditions. The bootstrap is a procedure for finding the (approximate) sampling distribution from just one sample.
The original sample represents the distribution of the population from which it was drawn. Resamples, taken with replacement from the original sample are representative of what we would get from drawing many samples from the population (the distribution of the statistics calculated from each resample is known as the bootstrap distribution of the statistic). The bootstrap distribution of a statistic represents that statistic’s sampling distribution.
TASK Study this cheatsheet and link the relevant sections to each step given above.
Old faithful is a gyser located in Yellowstone National Park, Wyoming. Below is a histogram of the duration of 299 consecutive eruptions. Clearly the distribution is bimodal (has two modes)!
library(tidyverse)
ggplot(data = MASS::geyser, aes(x = duration)) +
geom_histogram() +
xlab("Duration of eruptions (m)") +
theme_linedraw()
Step 1: Calculating the observed mean eruption duration time:
mean <- MASS::geyser %>%
summarise(mean = mean(duration))
mean
## mean
## 1 3.460814
Step 2: Construct bootstrap distribution
## Number of times I want to bootstrap
nreps <- 1000
## initialize empty array to hold results
bootstrap_means <- numeric(nreps)
set.seed(1234) ## *****Remove this line for actual analyses*****
## This means that each run with produce the same results and
## agree with the printout that I show.
for (i in 1:nreps) {
## bootstrap. note with replacement
bootstrap_sample <- sample(MASS::geyser$duration, replace = TRUE)
## bootstrapped mean resample
bootstrap_means[i] <- mean(bootstrap_sample)
}
Step 3: Collate the sampling distribution
## results
results <- data.frame(bootstrap_means = bootstrap_means)
ggplot(data = results, aes(x = bootstrap_means)) +
geom_histogram() +
geom_vline(xintercept = as.numeric(mean)) +
ggtitle("Sampling distribution of the mean") +
xlab("Bootstrap means") + ylab("") + theme_classic()
Step 4: Calculate quantities of interest from the sampling distribution
bias <- as.numeric(mean) - mean(results$bootstrap_means)
bias
## [1] 0.001200111
sd(results$bootstrap_means)
## [1] 0.06740607
## compare to SEM of original data
MASS::geyser %>%
summarise(sem = sd(duration)/sqrt(length(duration)))
## sem
## 1 0.06638498
\[\text{statistic} \pm t^* \text{SE}_\text{bootstrap}\] where \(t*\) is the critical value of the \(t_{n-1}\) distribution with area \(C\) between \(-t^*\) and \(t^*\).
For \(C = 0.95\):
as.numeric(mean) + c(-1,1) * qt(0.975,298)*sd(results$bootstrap_means)
## [1] 3.328162 3.593466
So our 95% confidence interval is 3.3 to 3.6.
sort(results$bootstrap_means)[c(25,975)]
## [1] 3.328428 3.591081
Resampling procedures, the differences
Resampling methods are any of a variety of methods for doing one of the following
Permutation vs bootstrap test
The permutation test exploits symmetry under the null hypothesis.
A full permutation test p-value is exact, conditional on data values in the combined sample.
A bootstrap estimates the probability mechanism that generated the samples under the null hypothesis.
A bootstrap does not rely on any special symmetry or assumption or exchangeability.
“Type I Zoom error: you think people can hear you, but you’re actually on mute. Type II Zoom error: you think your muted, but actually people can hear you.”
A Type I error (false positive): declare a difference (i.e., reject \(H_0\)) when there is no difference (i.e. \(H_0\) is true). Risk of the Type I error is determined by the level of significance (which we set!) (i.e., \(\alpha =\text{ P(Type I error)} = \text{P(false positive)}\).
A Type II error (false negative): difference not declared (i.e., \(H_0\) not rejected) when there is a difference (i.e., \(H_0\) is false). Let \(\beta =\) P(do not reject \(H_0\) when \(H_0\) is false); so, \(1-\beta\) = P(reject \(H_0\) when \(H_0\) is false) = P(a true positive), which is the statistical power of the test.
Each time we carry out a hypothesis test the probability we get a false positive result (type I error) is given by \(\alpha\) (the level of significance we choose). The significance level is the probability of a Type I error (i.e., the probability of finding an effect that is not there, a false positive). The power of a test is the probability that the test correctly rejects the null hypothesis when the alternative hypothesis is true. The probability of finding an effect that is there = 1 - probability of a Type II error (false negative).
Reducing the chance of a Type I error increases the chance of a Type II error. They are inversely related. Type II error rate is determined by a combination of the following.
When we have multiple comparisons to make we should then control the Type I error rate across the entire family of tests under consideration, i.e., control the Family-Wise Error Rate (FWER); this ensures that the risk of making at least one Type I error among the family of comparisons in the experiment is \(\alpha\). More on this in later chapters.
| State of Nature | Don’t reject \(H_0\) | reject \(H_0\) |
|---|---|---|
| \(H_0\) is true | Type I error | |
| \(H_0\) is false | Type II error |
“Good statistical practice, as an essential component of good scientific practice, emphasizes principles of good study design and conduct, a variety of numerical and graphical summaries of data, understanding of the phenomenon under study, interpretation of results in context, complete reporting and proper logical and quantitative understanding of what data summaries mean. No single index should substitute for scientific reasoning.”
Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value.
There are many different schools of thought about how a p-value should be interpreted.
Most people agree that the p-value is a useful measure of the strength of evidence against the null hypothesis. The smaller the p-value, the stronger the evidence against \(H_0\).
Some people go further and use an accept/reject framework. Under this framework, the null hypothesis \(H_0\) should be rejected if the p-value is less than 0.05 (say), and accepted if the p-value is greater than 0.05. In this course we mostly use the strength of evidence interpretation. The p-value measures how far out our observation lies in the tails of the distribution specified by \(H_0\):
In summary, p-values can indicate how incompatible the data are with a specified statistical model; they do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
Note that a p-value does not measure the size of an effect or the importance of a result and by itself it does not provide a good measure of evidence regarding a model or hypothesis. Note also, that a substantial evidence of a difference does not equate to evidence of a substantial difference! Any scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold as proper inference requires full reporting and transparency.
Remember that statistical significance does not imply practical significance, and that statistical significance says nothing about the size of treatment differences. To estimate the sizes of differences you need confidence intervals, look out for these in the following chapters.
Some p-value threshold recursive FAQs
| Question | Answer |
|---|---|
| Why do so many colleges and grad schools teach p-val=0.05? | Because that’s still what the scientific community and journal editors use. BUT IT SHOULDN’T BE |
Recall, the P\(\overline{\text{a}}\)ua dataset. It contains the following variables
Age of P\(\overline{\text{a}}\)ua in years (calculated from counting rings in the cone)Length of P\(\overline{\text{a}}\)ua shell in centimetersSpecies of P\(\overline{\text{a}}\)ua: Haliotis iris (typically found in NZ) and
Haliotis australis (less commonly found in NZ)library(tidyverse)
paua <- read_csv("https://raw.githubusercontent.com/STATS-UOA/databunker/master/data/paua.csv")
glimpse(paua)
## Rows: 60
## Columns: 3
## $ Species <chr> "Haliotis iris", "Haliotis australis", "Haliotis australis", "…
## $ Length <dbl> 1.80, 5.40, 4.80, 5.75, 5.65, 2.80, 5.90, 3.75, 7.20, 4.25, 6.…
## $ Age <dbl> 1.497884, 11.877010, 5.416991, 4.497799, 5.500789, 2.500972, 6…
Using a violin plot we can look at the distribution of shell Length. We can calculate the average Length of all shells in our sample
paua %>% summarise(average_length = mean(Length))
## # A tibble: 1 × 1
## average_length
## <dbl>
## 1 5.19
What about drawing inference? Do we believe that the average length of P\(\overline{\text{a}}\)ua shells is, say, 5cm? We know our sample average, but can we make any claims based on this one number?
How do we reflect our uncertainty about the population mean? Remember, it’s the population we want to make inference on based on our sample!
Enter the Standard Error of the Mean (SEM), \[\text{SEM} = \frac{\sigma}{\sqrt{n}}, \] where \(\sigma = \sqrt{\frac{\Sigma_{i = 1}^n(x_i - \bar{x})^2}{n-1}}\) (\(i = 1,...,n\)) is the standard deviation (SD) of the sample, \(n\) is the sample size, and \(\bar{x}\) is the sample mean.
Calculating \(\Sigma_{i = 1}^n(x_i - \bar{x})^2, i = 1,...,n\) by hand.
In, relatively, plain English this sum is the sum squared differences of the distances between the \(i^{th}\) observation and the sample mean \(\bar{x}\) (denoted \(\mu_x\) in the GIF below)
So using the example values in the GIF
## our sample of values
x <- c(1,2,3,5,6,9)
## sample mean
sample_mean <- mean(x)
sample_mean
## [1] 4.333333
## distance from mean for each value
distance_from_mean <- x - sample_mean
distance_from_mean
## [1] -3.3333333 -2.3333333 -1.3333333 0.6666667 1.6666667 4.6666667
## squared distance from mean for each value
squared_distance_from_mean <- distance_from_mean^2
squared_distance_from_mean
## [1] 11.1111111 5.4444444 1.7777778 0.4444444 2.7777778 21.7777778
## sum of the squared distances
sum(squared_distance_from_mean)
## [1] 43.33333
Calculating SD and SEM
Now what about the SD? Remember it’s the \(\sqrt{\frac{\Sigma_{i = 1}^n(x_i - \bar{x})^2}{n-1}}\) so = \(\sqrt{\frac{43.3333333}{n-1}}\) = \(\sqrt{\frac{43.3333333}{6-1}}\) = \(\sqrt{\frac{43.3333333}{5}}\) = 2.9439203.
Or we could just use R’s sd() function
sd(x)
## [1] 2.94392
So the SEM is \(\frac{\text{SD}}{\sqrt{n}}\) = \(\frac{2.9439203}{\sqrt{6}}\)
In R
sd(x)/sqrt(length(x))
## [1] 1.20185
For the paua data we can simply use the in-built functions in R to calculate the SEM
sem <- paua %>% summarise(mean = mean(Length),
sem = sd(Length)/sqrt(length(Length)))
sem
## # A tibble: 1 × 2
## mean sem
## <dbl> <dbl>
## 1 5.19 0.155
Visualising the uncertainty
Recall that the SEM is a measure of uncertainty about the mean. So we can use it to express our uncertainty visually. Typically \(\pm\) twice the SEM is the interval used:
Why error bars that are \(\pm\) twice the SEM?
This is approximately the 95% confidence interval for the population mean (see lecture)
The exact 95% CI is given by \(\bar{x}\) (mean) \(\pm\) \(t_{df,1 - \alpha/2}\) \(\times\) SEM
Each mean has its own confidence interval whose width depends on the SEM for that mean
When the df (more on these later) are large (e.g. 30 or greater) and \(\alpha\) = 0.05 \(t_{df,1 - \alpha/2}\) = \(t_{large,0.975}\) \(\approx\) 2. Hence, the 95% confidence interval for the population mean is approximately \(\bar{x}\) (mean) \(\pm\) 2 \(\times\) SEM
Back to our hypothesis test
Question: Do we believe that the average length of P\(\overline{\text{a}}\)ua shells is 5cm?
Formalizing into a hypothesis test:
Calculating a statistic (we use a t-statistic)
t-statistic \(= \frac{\bar{x}- \mu}{\text{SEM}}\) = \(\frac{5.1925 - 5}{0.155351}\) = 1.239
\(\bar{x}\) is the sample mean
\(\mu\) is the theoretical value (proposed mean)
The corresponding p-value
Recall that a p-value is the probability under a specified statistical model that a statistical summary of the data would be equal to or more extreme than its observed value
So in this case it’s the probability, under the null hypothesis (\(\mu = 5\)), that we would observe a statistic as least as extreme as we did.
Under our null hypothesis the distribution of the t-statistic is as below. The one calculated from our hypothesis test was 1.2391. Now, remember that our alternative hypotheses was \(H_1: \mu \neq 5\) so we have to consider both sides of the inequality; hence, anything as least as extreme is either \(> 1.2391\) or \(< -1.2391\) to our observed statistic (vertical lines). Anything as least as extreme is therefore given by the grey shaded areas.
We can calculate the p-value using the pt() function (where q is our calculated t-statistic, and df are the degrees of freedom from above):
2*(1 - pt(q = 1.2391,df = 59))
## [1] 0.2202152
Or we could do all of the above in one step using R
t.test(paua$Length, mu = 5 )
##
## One Sample t-test
##
## data: paua$Length
## t = 1.2391, df = 59, p-value = 0.2202
## alternative hypothesis: true mean is not equal to 5
## 95 percent confidence interval:
## 4.881643 5.503357
## sample estimates:
## mean of x
## 5.1925
Recall, that the p-value gives the probability that under our null hypothesis we observe anything as least as extreme as what we did (hence the \(\times 2\), think of the grey shaded area in the graph). This probability is \(\sim\) 22%. Do you think what we’ve observed is likely under the null hypothesis?
Does this plot help? The proposed mean is shown by the blue horizontal line; the dashed line shows the sample mean and the dotted lines \(\pm\) the SEM.
Calculating the differences between species means:
Haliotis australis average - Haliotis iris average = \(\mu_{\text{Haliotis australis}} - \mu_{\text{Haliotis iris}}\) = 5.767 - 4.81 = 0.957. Doesn’t really tell us much…
As above the average values are all well and good, but what about variation? Recall the SEM from the one-sample t-test? The same idea holds here, although the calculation is a little bit more complicated (as we have to think about the number of observations in each group). But from the two group SEMs we can calculate the Standard Error of the Difference between two means, SED.
t.test()Question: Do we believe that on average the length of P\(\overline{\text{a}}\)ua shells are equal between species?
Formalizing into a hypothesis test:
Let us now calculate the test statistic: t-statistic = \(\frac{\bar{x}_{\text{difference}} - \mu}{\text{SED}}\) = \(\frac{\bar{x}_{\text{difference}} - 0}{\text{SED}}n\) where \(\bar{x}_{\text{difference}}\) is the differences between the species` averages.
Calculations area a little bit more tricky here so let’s use R:
test <- t.test(Length ~ Species, data = paua)
## printing out the result
test
##
## Welch Two Sample t-test
##
## data: Length by Species
## t = 3.5404, df = 57.955, p-value = 0.0007957
## alternative hypothesis: true difference in means between group Haliotis australis and group Haliotis iris is not equal to 0
## 95 percent confidence interval:
## 0.4158802 1.4980086
## sample estimates:
## mean in group Haliotis australis mean in group Haliotis iris
## 5.766667 4.809722
test$p.value
## [1] 0.0007956853
Listed are the t-statistic, t = 3.5403636 and the p-value, p-value = 7.96^{-4} for the hypothesis test outlined above. What would you conclude?
What if we had more than two groups?
In this section we consider the following data of the calculated logAUC for 12 rats subjected to three different treatments (Surgery), C, P, and S:
| Surgery | Rat | logAUC |
|---|---|---|
| C | 1 | 8.49 |
| C | 2 | 8.20 |
| C | 3 | 9.08 |
| C | 4 | 8.07 |
| P | 1 | 10.24 |
| P | 2 | 7.72 |
| P | 3 | 9.34 |
| P | 4 | 8.50 |
| S | 1 | 11.31 |
| S | 2 | 12.69 |
| S | 3 | 11.37 |
| S | 4 | 10.82 |
To read the data directly into R you can use
readr::read_csv("https://raw.githubusercontent.com/STATS-UOA/databunker/master/data/crd_rats_data.csv")
The idea: Assess distances between treatment (surgical condition) means relative to our uncertainty about the actual (true) treatment means.
Add up the differences: -1.192 + -0.703 + 1.895 = 0. This is always the case!
So adding up the differences: -1.192 + -0.703 + 1.895 = 0. Not a great way to measure distances!
Sums of Squares?
-1.192^2 + -0.703^2 + 1.895^2
add up the squared differences? but… there are 4 observations in each group (treatment)
4\(\times\)(-1.192)^2 + 4\(\times\)(-0.703)^2 + 4\(\times\)(1.895)^2
This is the Between Groups Sums of Squares or the Between group SS (SSB)
So the Between group SS (SSB) = 22.02635
Adding up the differences: -1.192 + -0.703 + 1.895 = 0. This is always the case and that itself gives us information…
We only need to know two of the values to work out the third!
So we have only 2 bits of unique information; SSB degrees of freedom = 2
The Within group SS (SSW) arises from the same idea:
To assess distances between treatment (surgical condition) means relative to our uncertainty about the actual (true) treatment means.
Procedure:
Within group SS (SSW) unexplained variance
Recall the Between group SS (SSB) = 22.02635
So mean SSB = 22.02635 / 2
The within group SS (SSW) = 6.059075
Here we have 3\(\times\) 3 bits of unique information: within groups degrees of freedom is 9.
So mean SSW = 6.059/9
Consider the ratio \({\frac{{\text{variation due to treatments}}}{{\text{unexplained variance}}}} = {\frac{{\text{ mean between-group variability}}}{{\text{mean within-group variability}}}}\) \(=\frac{\text{mean SSB}}{\text{mean SSW}}\) \(=\frac{\text{MSB}}{\text{MSW}}\) = \(=\frac{\text{experimental variance}}{\text{error variance}}\) 16.3586975
This is the F-statistic!
Essentially statistical currency (i.e., unique bits of information). So in the example above we have 3 treatment groups and if we know the mean of two we know the third (i.e., 2 unique bits of info) so SSB df = 2.
Now, for SSW df We have 12 observations (4 in each group); we know the treatment means so if we have three of those observed values in each group we know the fourth: 12 - 3 = 9 (i.e., number of observations - number of df lost due to knowing the cell means).
Hypothesis: We test the Null hypothesis, \(H_0\), population (Surgery) means are the same on average verses the alternative hypothesis, \(H_1\), that at least one differs from the others!
Under our null hypothesis the distribution of the F-statistic is as below. The value calculated from our hypothesis test was 16.359. This observed test statistic is shown by the vertical line. To calculate the relevant p-value we can use
(1 - pf(q = 16.351,df1 = 2, df2 = 9))
## [1] 0.001007825
Thus, the probability of getting an F-statistic at least as extreme as the one we observe (think of the area under the tails of the curve below) p-value Pr(>F)= 0.001 tells us we have sufficient evidence to reject \(H_0\) at the 1% level of significance.
Alternatively, we could do this in one step using aov():
summary(aov(logAUC ~ Surgery, data = rats))
## Df Sum Sq Mean Sq F value Pr(>F)
## Surgery 2 22.026 11.013 16.36 0.00101 **
## Residuals 9 6.059 0.673
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RIn R to specify the model you want to fit you typically create a model formula object; this is usually then passed as the first argument to the model fitting function (e.g., lm()).
Some notes on syntax:
Consider the model formula example y ~ x + z + x:z. There is a lot going on here:
~ specifies the response, everything to the right specify the explanatory variables+ indicated to include the variable to the left of it and to the right of it (it does not mean they should be summed): denotes the interaction of the variables to its left and rightAdditional, some other symbols have special meanings in model formula:
* means to include all main effects and interactions, so a*b is the same as a + b + a:b
^ is used to include main effects and interactions up to a specified level. For example, (a + b + c)^2 is equivalent to a + b + c + a:b + a:c + b:c (note (a + b + c)^3 would also add a:b:c)
- excludes terms that might otherwise be included. For example, -1 excludes the intercept otherwise included by default, and a*b - b would produce a + a:b
Mathematical functions can also be directly used in the model formula to transform a variable directly (e.g., y ~ exp(x) + log(z) + x:z). One thing that may seem counter intuitive is in creating polynomial expressions (e.g., x2). Here the expression y ~ x^2 does not relate to squaring the explanatory variable x (this is to do with the syntax ^ you see above. To include x^2 as a term in our model we have to use the I() (the “as-is” operator). For example, y ~ I(x^2)).
| Traditional name | Model formula | R code |
|---|---|---|
| Simple regression | \(Y \sim X_{continuous}\) | lm(Y ~ X) |
| One-way ANOVA | \(Y \sim X_{categorical}\) | lm(Y ~ X) |
| Two-way ANOVA | \(Y \sim X1_{categorical} + X2_{categorical}\) | lm(Y ~ X1 + X2) |
| ANCOVA | \(Y \sim X1_{continuous} + X2_{categorical}\) | lm(Y ~ X1 + X2) |
| Multiple regression | \(Y \sim X1_{continuous} + X2_{continuous}\) | lm(Y ~ X1 + X2) |
| Factorial ANOVA | \(Y \sim X1_{categorical} * X2_{categorical}\) | lm(Y ~ X1 * X2) or lm(Y ~ X1 + X2 + X1:X2) |
Let’s consider a linear regression with a simple explanatory variable:
\[Y_i = \alpha + \beta_1x_i + \epsilon_i\] where
\[\epsilon_i \sim \text{Normal}(0,\sigma^2).\]
Here for observation \(i\)
Remember the penguins from the first module?
Key assumptions
library(tidyverse)
library(palmerpenguins)
penguins_nafree <- penguins %>% drop_na()
ggplot(data = penguins_nafree, aes(x = bill_depth_mm)) +
geom_histogram() + theme_classic() +
xlab("Bill depth (mm)")
First off let’s fit a null (intercept only) model. This in old money would be called a one sample t-test.
slm_null <- lm(bill_depth_mm ~ 1, data = penguins_nafree)
summary(slm_null)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.16486 0.1079134 159.0614 1.965076e-315
Model formula
This model, from above, is simply \[Y_i = \alpha + \epsilon_i.\]
Here for observation \(i\) \(Y_i\) is the value of the response (bill_depth_mm) and \(\alpha\) is a parameter to be estimated (typically called the intercept).
Inference
The (Intercept) term, 17.1648649, tells us the (estimated) average value of the response (bill_depth_mm), see
penguins_nafree %>% summarise(average_bill_depth = mean(bill_depth_mm))
## # A tibble: 1 × 1
## average_bill_depth
## <dbl>
## 1 17.2
The SEM (Std. Error) = 0.1079134.
The hypothesis being tested is \(H_0:\) ((Intercept) ) \(\text{mean}_{\text{`average_bill_depth`}} = 0\) vs. \(H_1:\) ((Intercept)) \(\text{mean}_{\text{`average_bill_depth`}} \neq 0\)
The t-statistic is given by t value = Estimate / Std. Error = 159.0614207
The p-value is given byPr (>|t|) = 1.9650761^{-315}.
So the probability of observing a t-statistic as least as extreme given under the null hypothesis (average bill depth = 0) given our data is 1.9650761^{-315}, pretty strong evidence against the null hypothesis I’d say!
Does bill_length_mm help explain some of the variation in bill_depth_mm?
p1 <- ggplot(data = penguins_nafree, aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point() + ylab("Bill depth (mm)") +
xlab("Bill length (mm)") + theme_classic()
p1
slm <- lm(bill_depth_mm ~ bill_length_mm, data = penguins_nafree)
Model formula
This model is simply \[Y_i = \alpha + \beta_1x_i + \epsilon_i\] where for observation \(i\) \(Y_i\) is the value of the response (bill_depth_mm) and \(x_i\) is the value of the explanatory variable (bill_length_mm); As above \(\alpha\) and \(\beta_1\) are parameters to be estimated. We could also write this model as
\[ \begin{aligned} \operatorname{bill\_depth\_mm} &= \alpha + \beta_{1}(\operatorname{bill\_length\_mm}) + \epsilon \end{aligned} \]
Fitted model
As before we extract the estimated parameters (here \(\alpha\) and \(\beta_1\)) using
summary(slm)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.78664867 0.85417308 24.335406 1.026904e-75
## bill_length_mm -0.08232675 0.01926835 -4.272642 2.528290e-05
Here, the (Intercept): Estimate (\(\alpha\) above) gives us the estimated average bill depth (mm) given the estimated relationship bill length (mm) and bill length.
The bill_length_mm : Estimate (\(\beta_1\) above) is the slope associated with bill length (mm). So, here for every 1mm increase in bill length we estimated a 0.082mm decrease (or a -0.082mm increase) in bill depth.
## calculate predicted values
penguins_nafree$pred_vals <- predict(slm)
## plot
ggplot(data = penguins_nafree, aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point() + ylab("Bill depth (mm)") +
xlab("Bill length (mm)") + theme_classic() +
geom_line(aes(y = pred_vals))
Does bill_depth_mm vary between species? Think of ANOVA.
p1 <- ggplot(data = penguins_nafree, aes(x = species, y = bill_depth_mm)) +
geom_violin() + ylab("Bill depth (mm)") +
xlab("Species") + theme_classic() + stat_summary(stat = "mean")
p1
sflm <- lm(bill_depth_mm ~ species, data = penguins_nafree)
Model formula
This model is simply \[Y_i = \alpha + \beta_1x_i + \epsilon_i\] where for observation \(i\), \(Y_i\) is the value of the response (bill_depth_mm) and \(x_i\) is the value of the explanatory variable (species). However, species is a factor variable. This means we need to use dummy variables:
\[ \begin{aligned} \operatorname{bill\_depth\_mm} &= \alpha + \beta_{1}(\operatorname{species}_{\operatorname{Chinstrap}}) + \beta_{2}(\operatorname{species}_{\operatorname{Gentoo}}) + \epsilon \end{aligned} \]
Fitted model
The estimated parameters (here \(\alpha\), \(\beta_1\), and \(\beta_2\)) are
summary(sflm)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.34726027 0.09299608 197.2906740 0.000000e+00
## speciesChinstrap 0.07332796 0.16497460 0.4444803 6.569867e-01
## speciesGentoo -3.35062162 0.13877592 -24.1441135 6.622121e-75
Here, the (Intercept): Estimate (\(\alpha\) above) gives us the estimated average bill depth (mm) of the baseline species Adelie (R does this alphabetically, to change this see previous chapter) and the remaining coefficients are the estimated difference between the baseline and the other categories (species). The Std. Error column gives the standard error for these estimated parameters (in the case of the group differences there are the SEDs, standard error of the difference).
## calculate predicted values
penguins_nafree$pred_vals <- predict(sflm)
## plot
ggplot(data = penguins_nafree, aes(x = species, y = bill_depth_mm, fill = species)) +
geom_violin() + ylab("Bill depth (mm)") +
xlab("Species") + theme_classic() +
geom_hline(aes(yintercept = pred_vals, col = species), linewidth = 2) +
stat_summary(stat = "mean")
p2 <- ggplot(data = penguins_nafree,
aes(y = bill_depth_mm, x = bill_length_mm, color = species)) +
geom_point() + ylab("Bill depth (mm)") +
xlab("Bill length (mm)") + theme_classic()
p2
slm_sp <- lm(bill_depth_mm ~ bill_length_mm + species, data = penguins_nafree)
Model formula
Now we have two explanatory variables, so our model formula becomes
\[Y_i = \beta_0 + \beta_1z_i + \beta_2x_i + \epsilon_i\] \[\epsilon_i \sim \text{Normal}(0,\sigma^2)\]
where for observation \(i\)
bill_depth_mm)bill_length_mm say)species say)Here, again, the Adelie group are the baseline.
So model formula is
\[ \begin{aligned} \operatorname{bill\_depth\_mm} &= \alpha + \beta_{1}(\operatorname{bill\_length\_mm}) + \beta_{2}(\operatorname{species}_{\operatorname{Chinstrap}}) + \beta_{3}(\operatorname{species}_{\operatorname{Gentoo}})\ + \\ &\quad \epsilon \end{aligned} \]
Fitted model
summary(slm_sp)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.5652616 0.69092642 15.291442 2.977289e-40
## bill_length_mm 0.2004431 0.01767974 11.337449 2.258955e-25
## speciesChinstrap -1.9330779 0.22571878 -8.564099 4.259893e-16
## speciesGentoo -5.1033153 0.19439523 -26.252267 1.043789e-82
Simpson’s paradox… look how the slope associated with bill length (coefficient of bill_length_mm) has switched direction from the model above! Why do you think this is?
Here, the (Intercept): Estimate gives us the estimated average bill depth (mm) of the Adelie penguins given the estimated relationship between bill length and bill depth. Technically, this is the estimated bill depth (mm) for Adelie penguins with zero bill length. That is clearly a nonsense way to interpret this as that would be an impossible situation in practice! I would recommend as thinking of this as the y-shift (i.e., height) of the fitted line.
The bill_length_mm : Estimate (\(\beta_1\) above) is the slope associated with bill length (mm). So, here for every 1mm increase in bill length we estimated a 0.2mm increase in bill depth.
What about the coefficient of the other species levels? Look at the plot below, these values give the shift (up or down) of the parallel lines from the Adelie level. So given the estimated relationship between bill depth and bill length these coefficients are the estimated change from the baseline.
## calculate predicted values
penguins_nafree$pred_vals <- predict(slm_sp)
## plot
ggplot(data = penguins_nafree, aes(y = bill_depth_mm, x = bill_length_mm, color = species)) +
geom_point() + ylab("Bill depth (mm)") +
xlab("Bill length (mm)") + theme_classic() +
geom_line(aes(y = pred_vals))
Recall the (additive) model formula from above
\[Y_i = \beta_0 + \beta_1z_i + \beta_2x_i + \epsilon_i\]
but what about interactions between variables? For example,
\[Y_i = \beta_0 + \beta_1z_i + \beta_2x_i + \beta_3z_ix_i + \epsilon_i\]
Note: to include interaction effects in our model by using either the * or : syntax in our model formula. For example,
: denotes the interaction of the variables to its left and right, and
* means to include all main effects and interactions, so a*b is the same as a + b + a:b.
To specify a model with additive and interaction effects we use
slm_int <- lm(bill_depth_mm ~ bill_length_mm*species, data = penguins_nafree)
Model formula
The model formula is then
\[ \begin{aligned} \operatorname{bill\_depth\_mm} &= \alpha + \beta_{1}(\operatorname{bill\_length\_mm}) + \beta_{2}(\operatorname{species}_{\operatorname{Chinstrap}}) + \beta_{3}(\operatorname{species}_{\operatorname{Gentoo}})\ + \\ &\quad \beta_{4}(\operatorname{bill\_length\_mm} \times \operatorname{species}_{\operatorname{Chinstrap}}) + \beta_{5}(\operatorname{bill\_length\_mm} \times \operatorname{species}_{\operatorname{Gentoo}}) + \epsilon \end{aligned} \]
Fitted model
summary(slm_int)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.48770713 1.15987305 9.9042797 2.135979e-20
## bill_length_mm 0.17668344 0.02980564 5.9278518 7.793199e-09
## speciesChinstrap -3.91856701 2.06730876 -1.8954919 5.890889e-02
## speciesGentoo -6.36675118 1.77989710 -3.5770333 4.000274e-04
## bill_length_mm:speciesChinstrap 0.04552828 0.04594283 0.9909769 3.224296e-01
## bill_length_mm:speciesGentoo 0.03092816 0.04111608 0.7522157 4.524625e-01
This can be written out:
\[ \begin{aligned} \operatorname{\widehat{bill\_depth\_mm}} &= 11.49 + 0.18(\operatorname{bill\_length\_mm}) - 3.92(\operatorname{species}_{\operatorname{Chinstrap}}) - 6.37(\operatorname{species}_{\operatorname{Gentoo}})\ + \\ &\quad 0.05(\operatorname{bill\_length\_mm} \times \operatorname{species}_{\operatorname{Chinstrap}}) + 0.03(\operatorname{bill\_length\_mm} \times \operatorname{species}_{\operatorname{Gentoo}}) \end{aligned} \]
The interaction terms (i.e., bill_length_mm:speciesChinstrap and bill_length_mm:speciesGentoo) specify the species specific slopes given the other variables in the model. We can use our dummy variable trick again to interpret the coefficients correctly. In this instance we have one factor explanatory variable, Species, and one continuous explanatory variable, bill length (mm). As before, Adelie is our baseline reference.
Let’s assume we’re talking about the Adelie penguins, then our equation becomes (using the dummy variable technique)
\[\widehat{\text{bill_depth_mm}} = 11.488 + (0.177 \times \text{bill_length_mm}).\]
So, the (Intercept): Estimate term (\(\hat{\alpha}\)), again, specifies the height of the Adelie fitted line and the main effect of bill_length_mm: Estimate (\(\hat{\beta_1}\)) estimates the relationship (slope) between bill length (mm) and bill depth (mm) for the Adelie penguin. So, here for every 1mm increase in bill length (mm) for the Adelie penguins we estimate, on average, a 0.177mm increase in bill depth (mm).
Now, what about Gentoo penguins? Our equation then becomes
\[\widehat{\text{bill_depth_mm}} = 11.488 + (0.177 \times \text{bill_length_mm}) + (-6.367) + (0.031 \times \text{bill_length_mm}).\]
which simplifies to
\[\widehat{\text{bill_depth_mm}} = 5.121 + (0.208 \times \text{bill_length_mm}).\]
The estimated Gentoo-specific intercept term (y-axis line height) is therefore \(\hat{\alpha} + \hat{\beta_3} = 11.488 + (-6.367) = 5.121.\) The Gentoo-specific bill length (mm) slope is then \(\hat{\beta_1} + \hat{\beta_5} = 0.177 + 0.031 = 0.208.\) So, for every 1mm increase in bill length (mm) for the Gentoo penguins we estimate, on average, a 0.208mm increase in bill depth (mm), a slightly steeper slope than for the estimated Adelie relationship (i.e., we estimate that as Gentoo bills get longer their depth increases at a, slightly, greater rate than those of Adelie penguins).
In summary, the main effect of species (i.e., speciesChinstrap: Estimate and speciesGentoo:Estimate ) again give the shift (up or down) of the lines from the Adelie level. However, these lines are no longer parallel (i.e., each species of penguin has a different estimated relationship between bill length and bill depth)!
But, now we’ve specified this all singing and dancing interaction model we might ask are the non-parallel lines non-parallel enough to reject the parallel line model? Look at the plot below; what do you think?
Other possible models
Let’s assume that we have the same data types as in the previous section, specifically, a continuous response (\(y\)) and one factor (\(f\), with two levels \(A\) and \(B\)) and one continuous (\(x\)) explanatory variable. Assuming that \(A\) is the baseline for \(f\) the possible models are depicted below.
Note that models III and V are forced to have the same intercept for both levels of \(f\). In addition, when you have no main effect of \(x\), models IV and V, then the model is forced to have no effect of \(x\) for the baseline level of \(f\) (in this case \(A\)).
Remember that it is always is imperative that we check the underlying assumptions of our model! If our assumptions are not met then basically the maths falls over and we can’t reliably draw inference from the model (e.g., can’t trust the parameter estimates etc.). Two of the most important assumption are:
equal variances (homogeneity of variance), and
normality of residuals.
Let’s look at the fit of the slm model (single continuous explanatory variable)
gglm::gglm(slm_sp) # Plot the four main diagnostic plots
Residuals vs Fitted plot
You are basically looking for no pattern or structure in your residuals (e.g., a “starry” night). You definitely don’t want to see is the scatter increasing around the zero line (dashed line) as the fitted values get bigger (e.g., think of a trumpet, a wedge of cheese, or even a slice of pizza) which would indicate unequal variances (heteroscedacity).
Normal quantile-quantile (QQ) plot
This plot shows the sorted residuals versus expected order statistics from a standard normal distribution. Samples should be close to a line; points moving away from 45 degree line at the tails suggest the data are from a skewed distribution.
Scale-Location plot (\(\sqrt{\text{|standardized residuals vs Fitted|}}\))
Another way to check the homoskedasticity (constant-variance) assumption. We want the line to be roughly horizontal. If this is the case then the average magnitude of the standardized residuals isn’t changing much as a function of the fitted values. We’d also like the spread around the line not to vary much with the fitted values; then the variability of magnitudes doesn’t vary much as a function of the fitted values.
Residuals vs Leverage plot (standardized residuals vs Leverage)
This can help detect outliers in a linear regression model. For linear regression model leverage measures how sensitive a fitted value is to a change in the true response. We’re looking at how the spread of standardized residuals changes as the leverage. This can also be used to detect heteroskedasticity and non-linearity: the spread of standardized residuals shouldn’t change as a function of leverage. In addition, points with high leverage may be influential: that is, deleting them would change the model a lot.
Do you think the residuals are Normally distributed (look at the QQ plot)? Think of what this model is, do you think it’s the best we can do?
If the residuals do not show a violation of our model assumptions (let’s just pretend the plots above don’t show anything for concern) we can produce margins plots, which show the marginal effect of a predictor, holding other variables constant.
The ggpredict() function from the ggeffects package can be used for this:
ggeffects::ggpredict(slm_sp, terms = "species")
## # Predicted values of bill_depth_mm
##
## species | Predicted | 95% CI
## ------------------------------------
## Adelie | 19.38 | 19.15, 19.62
## Chinstrap | 17.45 | 17.17, 17.73
## Gentoo | 14.28 | 14.07, 14.49
##
## Adjusted for:
## * bill_length_mm = 43.99
We can even plot the marginal effects:
ggeffects::ggpredict(slm_sp, terms = "species") %>%
ggplot(aes(x = x, y = predicted, label = round(predicted, 3))) +
geom_point() +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = .5) +
geom_text(hjust = -.2) +
ylab( "Predicted bill depth (mm)") + xlab("") +
ggtitle("Fitted means for each species with 95% CIs", subtitle = "Adjusted for bill_length_mm = 43.99") +
theme_classic()
Are the non-parallel lines non-parallel enough to reject the parallel line model?
Now we can compare nested linear models by hypothesis testing. Luckily the R function anova() automates this. Yes the same idea as we’ve previously learnt about ANOVA! We essentially perform an F-ratio test between the nested models!
By nested we mean that one model is a subset of the other (i.e., where some coefficients have been fixed at zero). For example,
\[Y_i = \beta_0 + \beta_1z_i + \epsilon_i\]
is a nested version of
\[Y_i = \beta_0 + \beta_1z_i + \beta_2x_i + \epsilon_i\] where \(\beta_2\) has been fixed to zero.
As an example consider testing the single explanatory variable model slm against the same model with species included as a variable slm_sp. To carry out the appropriate hypothesis test in R we can run
anova(slm,slm_sp)
## Analysis of Variance Table
##
## Model 1: bill_depth_mm ~ bill_length_mm
## Model 2: bill_depth_mm ~ bill_length_mm + species
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 331 1220.16
## 2 329 299.62 2 920.55 505.41 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As you’ll see the anova() function takes the two model objects (slm and slm_sp) each as arguments. It returns an ANOVA testing whether the more complex model (slm_sp) is just as good at capturing the variation in the data as the simpler model (slm). The returned p-value should be interpreted as in any other hypothesis test. i.e., the probability of observing a statistic as least as extreme under our null hypothesis (here that each model is as good at capturing the variation in the data).
What would we conclude here? I’d say we have pretty strong evidence against the models being equally good! I’d definitely plump for slm_sp over slm, looking back at the plots above does this make sense?
Now what about slm_int vs slm_sp?
anova(slm_sp,slm_int)
## Analysis of Variance Table
##
## Model 1: bill_depth_mm ~ bill_length_mm + species
## Model 2: bill_depth_mm ~ bill_length_mm * species
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 329 299.62
## 2 327 298.62 2 0.99284 0.5436 0.5812
So it seems both models are just as good at capturing the variation in our data: we’re happy with the parallel lines!
Another way we might compare models is by using the Akaike information criterion (AIC) (you’ll see more of this later in the course). AIC is an estimator of out-of-sample prediction error and can be used as a metric to choose between competing models. Between nested models we’re looking for the smallest AIC (i.e., smallest out-of-sample prediction error). Typically, a difference of 4 or more is considered to indicate an improvement; this should not be taken as writ however, using multiple comparison techniques is advised.
R already has an AIC() function that can be used directly on your lm() model object(s). For example,
AIC(slm,slm_sp,slm_int)
## df AIC
## slm 3 1383.4462
## slm_sp 5 919.8347
## slm_int 7 922.7294
This backs up what our ANOVA suggested model slm_sp as that preferred! As always it’s important to do a sanity check! Does this make sense? Have a look at the outputs from these models and see what you think.
Just because we’ve chosen a model (the best of a bad bunch perhaps) this doesn’t let us off the hook. We should check our assumptions
gglm::gglm(slm_sp) # Plot the four main diagnostic plots
Residuals vs Fitted plot: equal spread? Doesn’t look too trumpety!
Normal quantile-quantile (QQ) plot: skewed? Maybe slightly right skewed (deviation upwards from the right tail)
Scale-Location plot: equal spared? I’d say so.
Residuals vs Leverage: ? Maybe a couple of points with high leverage.
After all that what do estimated parameters mean? Assuming our model assumptions are met we can draw inference based on the estimated parameter values.
Using the slm_sp model we can make a point prediction for the expected bill depth (mm) for Gentoo penguins with a bill length of 50mm.
Recall the model equation is
\[ \begin{aligned} \operatorname{bill\_depth\_mm} &= \alpha + \beta_{1}(\operatorname{bill\_length\_mm})\ + \\ &\quad \beta_{2}(\operatorname{species}_{\operatorname{Chinstrap}}) + \beta_{3}(\operatorname{species}_{\operatorname{Gentoo}})\ + \\ &\quad \epsilon \end{aligned} \]
The fitted equation is given as
\[ \begin{aligned} \operatorname{\widehat{bill\_depth\_mm}} &= 10.57 + 0.2(\operatorname{bill\_length\_mm})\ - \\ &\quad 1.93(\operatorname{species}_{\operatorname{Chinstrap}}) - 5.1(\operatorname{species}_{\operatorname{Gentoo}}) \end{aligned} \]
We can then simply substitute in the values:
\[\widehat{\text{bill depth}} = \hat{\alpha} + \hat{\beta_1}*50 + \hat{\beta_3}*1\] \[\downarrow\]
\[\widehat{\text{bill depth}} = 10.56 + 0.20*50 - 5.10*1\] \[\downarrow\]
\[15.47\text{mm}\]
Rather than by hand we can do this easily in R
## create new data frame with data we want to predict to
## the names have to match those in our original data frame
newdata <- data.frame(species = "Gentoo",bill_length_mm = 50)
## use predict() function
predict(slm_sp, newdata = newdata) ## more accurate than our by hand version!
## 1
## 15.4841
What does this look like on a plot?
For the chosen slm_sp model we can get these simply by using confint(). By default the 95% intervals are returned:
cis <- confint(slm_sp)
cis
## 2.5 % 97.5 %
## (Intercept) 9.2060707 11.9244526
## bill_length_mm 0.1656635 0.2352227
## speciesChinstrap -2.3771120 -1.4890438
## speciesGentoo -5.4857298 -4.7209009
So this tells us that For every 1mm increase in bill length we estimate the expected bill depth to increases between 0.166 and 0.235 mm
We estimate that the expected bill depth of a Chinstrap penguin is between 1.5 and 2.4 mm shallower than the Adelie penguin
Learning objectives
R with one categorical explanatory variable (one-way ANOVA) and draw the appropriate inferenceR with two categorical explanatory variables and an interaction (two-way ANOVA with interaction) and draw the appropriate inferenceR using aov(), lm(), and lmer() and discuss and compare the threeR code to visualise repeated measures dataR and draw the appropriate inferenceOther resources
“A useful property of a test of significance is that it exerts a sobering influence on the type of experimenter who jumps to conclusions on scanty data, and who might otherwise try to make everyone excited about some sensational treatment effect that can well be ascribed to the ordinary variation in [their] experiment.”
*An experiment is a procedure (or set of actions) where a researcher intentionally changes some factor/treatment/variable to observe the effect of their actions. As mentioned above, the collection of observational data is not experimentation.
An experimental unit is the smallest portion of experimental material which is independently perturbed. This is the item under study for which some variable (treatment) is changed. For example this could be a human subject or an agricultural plot.
An observational unit (or subsample) is the smallest unit on which a response is measured. If the experimental unit is split after the treatment has been applied (e.g., multiple samples taken from one human subject) then this sample is called a subsample or observational unit. If one measurement is made on each experimental unit then the observational unit = the experimental unit. If multiple measurements are made on each subject (e.g., human) then each experimental unit has >1 observational unit. This is then pseudo- or technical replication (see below).
A treatment (or independent variable or factor or treatment factor) is an experimental condition independently applied to an experimental unit. It is one of the variables that is controlled by the researcher during the experiment (e.g., drug type). The values of the treatments within a set are called levels. The dependent variable or response is the output (or thing) that is measured after an experiment. This is what the researcher measures and assesses if changing the treatment(s) (i.e., independent variable(s)) induces any change. An effect is the change in the response variable caused by the controlled changes in the independent variable. Whether the magnitude of the effect (it’s size) is significant (or or any practical interest) is determined by the researcher after carrying out some appropriate analyses.
R. A. Fisher’s work in the area of experimental design is, perhaps, the most well known in the field. The principles he devised we still abide by today, see below. Note, however, to give a balanced view of the celebrated mathematician many of his views (on eugenics and race in particular) are abhorrent to many. I would urge you to read up on his work in this area and come to your own conclusions.
Biological replication: each treatment is independently applied to each of several humans, animals or plants. Why? To generalize results to population.
Technical replication: two or more samples from the same biological source which are independently processed. Why? Advantageous if processing steps introduce a lot of variation; increases the precision with which comparisons of relative abundances between treatments are made.
Pseudo-replication: one sample from the same biological source, divided into two or more aliquots which are independently measured. Why? Advantageous for noisy measuring instruments; increases the precision with which comparisons of relative abundances between treatments are made.
The main reason to randomise allocation of treatment to experimental units is to protect against bias. We, typically, wish to plan the experiment in such a way that the variations caused by extraneous factors can all be combined under the general heading of “chance”. Doing so ensures that each treatment has the same probability of getting good (or bad) units and thus avoids systematic bias. Random allocation can cancel out population bias; it ensures that any other possible causes for the experimental results are split equally between groups. Typically statistical analysis assumes that observations are independent. This is almost never strictly true in practice but randomisation means that our estimates will behave as if they were based on independent observations.
Blocking helps control variability by making treatment groups more alike. Experimental units are divided into subsets (called blocks) so that units within the same block are more similar than units from different subsets or blocks. Blocking is a technique for dealing with nuisance factors. A nuisance factor is a factor that has some effect on the response, but is of no interest.
When setting up and experiment and/or considering experimental data there are questions we should always answer the following questions.
Last Christmas I was given a gift set of three types of coffee beans. I want to know which makes the darkest coffee; to do this I measure the opacity after the coffee is made.
Experiment design:
There are 3 treatments (types of coffee beans): Arabica, Liberica, and Robusta.
In this case the experimental unit would be the coffee cup as each one is allocated a different bean type (treatment).
The observational unit changes depending on the scenario (i.e., what and how samples are taken). For example,
Let’s consider a completely randomised design with one treatment factor (e.g., coffee bean type). Here, \(n\) experimental units (e.g., cups) are divided randomly into \(t\) groups. Random allocation can be achieved by simply drawing lots from a hat! To be more rigorous, though, we could use R’s sample() function (have a go yourself and see if you can work out how to wield sample()). Each group is then given one treatment level (one of the treatment factors). As we have defined only one treatment factor all other known independent variables are kept constant so as to not bias any effects (e.g., cup type and type in Charlotte’s☕ experiment).
An illustration of a CRD with one tratment factor and three treatment levels (A, B, & C)
RLet’s assume you want to set up an experiment similar to the coffee one above using R to do the random allocation of treatments for you.
## create a character vector of bean types
beans <- rep(c("Arabica","Liberica", "Robusta"), each = 4)
beans
## [1] "Arabica" "Arabica" "Arabica" "Arabica" "Liberica" "Liberica"
## [7] "Liberica" "Liberica" "Robusta" "Robusta" "Robusta" "Robusta"
## randomly sample the character vector to give the order of coffees
set.seed(1234) ## this is ONLY for consistency, remove if doing this yourself
allocation <- sample(beans, 12)
allocation
## [1] "Robusta" "Robusta" "Liberica" "Liberica" "Arabica" "Liberica"
## [7] "Arabica" "Robusta" "Arabica" "Liberica" "Robusta" "Arabica"
Having run the code above your CRD plan is as follows
| Cup | Bean |
|---|---|
| 1 | Robusta |
| 2 | Robusta |
| 3 | Liberica |
| 4 | Liberica |
| 5 | Arabica |
| 6 | Liberica |
| 7 | Arabica |
| 8 | Robusta |
| 9 | Arabica |
| 10 | Liberica |
| 11 | Robusta |
| 12 | Arabica |
Let’s consider a randomised complete block design with one treatment factor (e.g., coffee bean type). If the treatment factor has \(t\) levels there will be \(b\) blocks that each contain \(t\) experimental units resulting in a total of \(t\times b\) experimental units. For example, let’s imagine that for the coffee experiment we had two cup types: mugs and heatproof glasses. We might consider the type of receptacle to have an effect on the coffee colour measured, however, we are not interested in this. Therefore, to negate this we block by cup type. This means that any effect due to the blocking factor (cup type) is accounted for by the blocking.
For a blocked design we want the \(t\) experimental units within each block should be as homogeneous as possible (as similar as possible, so that there is unlikely to be unwanted variation coming into the experiment this way). The variation between blocks (the groups of experimental units) should be large enough (i.e., blocking factors different enough) so that conclusions can be drawn. Allocation of treatments to experimental units is done randomly (i.e., treatments are randomly assigned to units) within each block.
An illustration of a CRD with one tratment factor, three treatment levels (A, B, & C), and three blocks (rows)
RLet’s assume you want to set up an experiment similar to the coffee one above; however, now you are in the situation where you have two types of cups (six mugs and six heatproof glasses). Below we use R to do the random allocation of treatments within each block (cup type) for you.
Here we have \(t = 3\) treatments (bean types) and \(b = 2\) blocks (cup types) so we will have \(t \times b = 6\) experimental units in total.
set.seed(432) ## this is ONLY for consistency, remove if doing this yourself
plan <- data.frame(Beans = rep(c("Arabica","Liberica", "Robusta"), times = 2),
Block = rep(c("Mug", "Glass"), each = 3)) %>% ## combine experiment variables
group_by(Block) %>% ## group by blocking factor
dplyr::sample_n(3)
plan
## # A tibble: 6 × 2
## # Groups: Block [2]
## Beans Block
## <chr> <chr>
## 1 Robusta Glass
## 2 Arabica Glass
## 3 Liberica Glass
## 4 Robusta Mug
## 5 Arabica Mug
## 6 Liberica Mug
Having run the code above your RCBD plan is as follows
| Cup | Beans | Block |
|---|---|---|
| 1 | Robusta | Glass |
| 2 | Arabica | Glass |
| 3 | Liberica | Glass |
| 1 | Robusta | Mug |
| 2 | Arabica | Mug |
| 3 | Liberica | Mug |
A factorial experiment is one where there are two or more sets of (factor) treatments. Rather than studying each factor separately all combinations of the treatment factors are considered. Factorial designs enable us to infer any interaction effects, which may be present. An interaction effect is one where the effect of one variable depends on the value of another variable (i.e., the effect of one treatment factor on the response variable will change depending on the value of a second treatment factor.)
For example, we could add another treatment to the coffee experiment above: grinder type (manual or electric ). So, now we have two treatments, bean type (with three levels) and cup type (with two levels). Recall that in a factorial experiment we want to study all combinations of the levels of each factor:
| Manual | Electric | |
|---|---|---|
| Arabica | A.M | A.E |
| Liberica | L.M | L. E |
| Robusta | R.M | R. E |
Note is a factorial design has equal numbers of replicates in each group then it is said to be a balanced design; if this is not the case then it is unbalanced.
Now, our question about the colour of coffee would be Does the colour of the difference in the colour of the coffee between the beans wall depend on the grinder used?
Possible outcomes
If an interaction effect exists the effect of one factor on the response will change depending on the level of the other factor:
The plot above is called an interaction plot, think back to the previous module. Creating such a plot is often very useful when drawing inference; in this instance we can see that the colour of the coffee changes depending on the type of coffee bean used, however, this relationship differs depending on the type of grinder used. For example, Liberica beans produce darker coffee than the other two beans when the electric grinder is used, but weaker coffee when the manual grinder is used.
As we’ve seen in the previous module that we can write a linear model with a single explanatory variable as
\[Y_i = \alpha + \beta_1x_i + \epsilon_i\]
When dealing with factor variables we use dummy variables and can write the above as
\[Y_{ik} = \alpha + \tau_k + \epsilon_{ik}\] where \(\tau_k\) is called an effect and represents the difference between the overall average, \(\alpha\), and the average at the \(k_{th}\) treatment level. The errors \(\epsilon_{ik}\) are again assumed to be normally distributed and independent due to the randomisation (i.e., \(\epsilon_{ik} \sim N(0, \sigma^2)\).
Or you might think of the model as
\[Y_{ik} = \mu_k + \epsilon_{ik}\]
where \(Y_{ik}\) is the response (i.e., observed coffee opacity) for the \(i^{th}\) experimental unit (i.e., coffee cup) subjected to the \(k^{th}\) level of the treatment factor (i.e., coffee type). Here \(\mu_k\) are the different (cell) means for each level of the treatment factor. See below for an illustration of this for three factor treatment levels (as in the coffee example above).
RIn this section we again consider the data containing logAUC values for 12 rats subjected to three different treatments (Surgery), C, P, and S:
| Surgery | Rat | logAUC |
|---|---|---|
| C | 1 | 8.49 |
| C | 2 | 8.20 |
| C | 3 | 9.08 |
| C | 4 | 8.07 |
| P | 1 | 10.24 |
| P | 2 | 7.72 |
| P | 3 | 9.34 |
| P | 4 | 8.50 |
| S | 1 | 11.31 |
| S | 2 | 12.69 |
| S | 3 | 11.37 |
| S | 4 | 10.82 |
To read the data directly into R you can use
readr::read_csv("https://raw.githubusercontent.com/STATS-UOA/databunker/master/data/crd_rats_data.csv")
We could analyse these data using aov():
rats_aov <- aov(logAUC ~ Surgery, data = rats)
summary(rats_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Surgery 2 22.026 11.013 16.36 0.00101 **
## Residuals 9 6.059 0.673
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hypothesis: We test the Null hypothesis, \(H_0\), population (Surgery) means are the same on average verses the alternative hypothesis, \(H_1\), that at least one differs from the others!
Probability of getting an F-statistic at least as extreme as the one we observe (think of the area under the tails of the curve below) p-value Pr(>F)= 0.001 tells us we have sufficient evidence to reject \(H_0\) at the 1% level of significance
Alternatively, we could use lm():
rats_lm <- lm(logAUC ~ Surgery, data = rats)
summary(rats_lm)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.4600 0.4102531 20.6214144 6.930903e-09
## SurgeryP 0.4900 0.5801856 0.8445574 4.202408e-01
## SurgeryS 3.0875 0.5801856 5.3215734 4.799872e-04
So, what does this tell us about which pairs of means are different?
To carry out a pair-wise comparisons of means we can use two-sample t-tests, calculating our observed t-value where \(\text{t-value} = \frac{\text{Sample Difference}_{ij} - \text{Difference assuming } H_0 \text{ is true}_{ij}}{\text{SE of } \text{Sample Difference}_{ij}}\). Here, \(\text{Sample Difference}_{ij}\) = Difference between pair of sample means. We can then compute the p-value for observed t-value.
The output above has already done this for us:
(Intercept) = \(\text{mean}_C\) = 8.46
SE of (Intercept) = SE of \(\text{mean}_C\) = SEM = 0.4102531
\(\text{Surgery}_P\) = \(\text{mean}_P\) – \(\text{mean}_C\) = 0.49
SE of \(\text{Surgery}_P\) = SE of (\(\text{mean}_P\) - \(\text{mean}_C\) ) = SED = 0.5801856
Hypotheses being tested
The t value and Pr (>|t|) are the t-statistic and p-value for testing the null hypotheses listed below
1. Mean abundance is zero for C population
2. No difference between the population means of P and C
3. No difference between the population means of S and C
We’re interested in 2 and 3, but not necessarily 1!
F-test:
anova(rats_lm)
## Analysis of Variance Table
##
## Response: logAUC
## Df Sum Sq Mean Sq F value Pr(>F)
## Surgery 2 22.0263 11.0132 16.359 0.001006 **
## Residuals 9 6.0591 0.6732
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The same as aov() in fact aov() is calling lm() in the background.
Carrying out any linear regression recall that we have some key assumptions
gglm::gglm(rats_lm) # Plot the four main diagnostic plots
What do you think? Look back at module 2.
Data Global metabolic profiling and comparison of relative abundances of proteins (logAUC) in the inner and outer left ventricle (innerLV and outerLV) wall of diabetic and healthy male Wistar rats. To read the data directly into R you can use
readr::read_csv("https://raw.githubusercontent.com/STATS-UOA/databunker/master/data/factorial_expt.csv")
| Organ | Diabetic | Healthy |
|---|---|---|
| innerLV | n = 3 | n = 3 |
| outerLV | n = 3 | n = 3 |
| Disease | Organ | Animal | Sample | logAUC |
|---|---|---|---|---|
| Healthy | innerLV | 1 | 1 | 9.40 |
| Healthy | outerLV | 2 | 2 | 8.83 |
| Healthy | innerLV | 3 | 1 | 10.33 |
| Healthy | outerLV | 4 | 2 | 10.49 |
| Healthy | innerLV | 5 | 1 | 9.74 |
| Healthy | outerLV | 6 | 2 | 10.98 |
| Diabetic | innerLV | 7 | 1 | 7.92 |
| Diabetic | outerLV | 8 | 2 | 9.37 |
| Diabetic | innerLV | 9 | 1 | 8.69 |
| Diabetic | outerLV | 10 | 2 | 11.31 |
| Diabetic | innerLV | 11 | 1 | 7.01 |
| Diabetic | outerLV | 12 | 2 | 9.29 |
Fitting models with interactions using lm()
## change to factors (saves errors later on)
factorial$Disease <- as.factor(factorial$Disease)
factorial$Organ <- as.factor(factorial$Organ)
fac_lm <- lm(logAUC ~ Disease*Organ, data = factorial)
summary(fac_lm)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.873333 0.5405835 14.564508 4.841215e-07
## DiseaseHealthy 1.950000 0.7645006 2.550685 3.413826e-02
## OrganouterLV 2.116667 0.7645006 2.768692 2.434579e-02
## DiseaseHealthy:OrganouterLV -1.840000 1.0811671 -1.701865 1.271934e-01
So, the full model is
\[ \begin{aligned} \operatorname{\widehat{logAUC}} &= 7.87 + 1.95(\operatorname{Disease}_{\operatorname{Healthy}}) + 2.12(\operatorname{Organ}_{\operatorname{outerLV}}) - 1.84(\operatorname{Disease}_{\operatorname{Healthy}} \times \operatorname{Organ}_{\operatorname{outerLV}}) \end{aligned} \]
The three gobal null hypotheses being tested are
anova(fac_lm)
## Analysis of Variance Table
##
## Response: logAUC
## Df Sum Sq Mean Sq F value Pr(>F)
## Disease 1 3.1827 3.1827 3.6304 0.09320 .
## Organ 1 4.2960 4.2960 4.9003 0.05775 .
## Disease:Organ 1 2.5392 2.5392 2.8963 0.12719
## Residuals 8 7.0135 0.8767
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What conclusions do you draw?
Note with a balanced design ordering of term doesn’t matter. For example,
fac_lm <- lm(logAUC ~ Disease*Organ, data = factorial)
anova(fac_lm)
## Analysis of Variance Table
##
## Response: logAUC
## Df Sum Sq Mean Sq F value Pr(>F)
## Disease 1 3.1827 3.1827 3.6304 0.09320 .
## Organ 1 4.2960 4.2960 4.9003 0.05775 .
## Disease:Organ 1 2.5392 2.5392 2.8963 0.12719
## Residuals 8 7.0135 0.8767
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fac_lm_2 <- lm(logAUC ~ Organ*Disease, data = factorial)
anova(fac_lm_2)
## Analysis of Variance Table
##
## Response: logAUC
## Df Sum Sq Mean Sq F value Pr(>F)
## Organ 1 4.2960 4.2960 4.9003 0.05775 .
## Disease 1 3.1827 3.1827 3.6304 0.09320 .
## Organ:Disease 1 2.5392 2.5392 2.8963 0.12719
## Residuals 8 7.0135 0.8767
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here, we consider a subset of the data above.
| Organ | Diabetic | Healthy |
|---|---|---|
| innerLV | n = 3 | n = 1 |
| outerLV | n = 2 | n = 2 |
| Disease | Organ | Animal | Sample | logAUC |
|---|---|---|---|---|
| Healthy | outerLV | 4 | 2 | 10.49 |
| Healthy | innerLV | 5 | 1 | 9.74 |
| Healthy | outerLV | 6 | 2 | 10.98 |
| Diabetic | innerLV | 7 | 1 | 7.92 |
| Diabetic | outerLV | 8 | 2 | 9.37 |
| Diabetic | innerLV | 9 | 1 | 8.69 |
| Diabetic | innerLV | 11 | 1 | 7.01 |
| Diabetic | outerLV | 12 | 2 | 9.29 |
Fitting models with interactions using lm()
Note: order matters. For example,
fac_lm <- lm(logAUC ~ Disease*Organ, data = unbalanced_nafree)
anova(fac_lm)
## Analysis of Variance Table
##
## Response: logAUC
## Df Sum Sq Mean Sq F value Pr(>F)
## Disease 1 7.1102 7.1102 18.4955 0.01264 *
## Organ 1 3.1149 3.1149 8.1027 0.04656 *
## Disease:Organ 1 0.0913 0.0913 0.2376 0.65145
## Residuals 4 1.5377 0.3844
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fac_lm_2 <- lm(logAUC ~ Organ*Disease, data = unbalanced_nafree)
anova(fac_lm_2)
## Analysis of Variance Table
##
## Response: logAUC
## Df Sum Sq Mean Sq F value Pr(>F)
## Organ 1 5.7291 5.7291 14.9029 0.01814 *
## Disease 1 4.4960 4.4960 11.6953 0.02678 *
## Organ:Disease 1 0.0913 0.0913 0.2376 0.65145
## Residuals 4 1.5377 0.3844
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The three global null hypotheses being tested are (essentially) the same
However, now the order the terms affects the estimation. Look carefully at the anova() outputs above; what to you notice about the sums of squares (Sum Sq) values?
For a balanced two-factor experiment (e.g., in the previous section) partitioning of the sums of squares (\(SS\)) is additive:
\[SS_\text{Treatment} = SS_\text{Disease} + SS_\text{Organ} + SS_\text{Disease:Organ}.\]
The order in which the main effects are added to the model does not matter. However, for an unbalanced design this is not the case as the sums of squares are calculated sequentially. Note that sequentially calculated \(SS\) are known as Type I \(SS\).
As a term enters the model its \(SS\) is calculated, which is then subtracted from the total \(SS\). This then reduces the available \(SS\) for the next term entering the model. When treatment combinations in a factorial experiment are unequally replicated, their effects are not mutually independent, so that the order in which terms enter the model matters.
Considering the data above if we include Disease as a main effect first (i.e., Disease*Organ) then the \(SS_\text{Disease}\) is calculated first ignoring the Organ main effect. Here, some Organ information is confounded with Disease information (i.e., variation due to Organ confounded by the variation due to Disease). Then \(SS_\text{Organ}\) are calculated having been adjusted for the Diease main effect. This now only contains Organ information (i.e., variation due to Organ effect) since all the Disease information was eliminated in previous step. Finally, \(SS_\text{Disease:Organ}\) are calculated adjusted for both \(SS_\text{Disease}\) and \(SS_\text{Organ}\). Here, there is no information left relating to Disease or Organ main effects.
What does this look like?
For Disease*Organ we calculate \(SS_\text{Disease}\) ignoring any Organ effect (if you are unsure what each line of code is doing I suggest you run it line by line to see what’s being done at each step).
unbalanced_nafree %>%
mutate(grand_mean = mean(logAUC)) %>%
group_by(Disease) %>%
summarise(n = n(),
treatment_mean = mean(logAUC),
grand_mean = mean(grand_mean)) %>%
mutate(ss_disease = n * (treatment_mean - grand_mean)^2) %>%
pull(ss_disease) %>%
sum()
## [1] 7.110201
This matches, as we’d expect, \(SS_\text{Disease}\) calculated from the Disease*Organ model above.
But, that about the sequential \(SS\), a simple way to think about this is in terms of sequential models:
## Null model.
fit.null <- lm(logAUC ~ 1, data = unbalanced_nafree)
## Only Factor Disease.
fit.a <- lm(logAUC ~ Disease, data = unbalanced_nafree)
## Factors Disease and Organ.
fit.ab <- lm(logAUC ~ Disease + Organ, data = unbalanced_nafree)
## Factors Disease and Organ with interaction.
fit.abi <- lm(logAUC ~ Disease*Organ, data = unbalanced_nafree)
## ANOVA table, as above
anova(fit.abi)
## Analysis of Variance Table
##
## Response: logAUC
## Df Sum Sq Mean Sq F value Pr(>F)
## Disease 1 7.1102 7.1102 18.4955 0.01264 *
## Organ 1 3.1149 3.1149 8.1027 0.04656 *
## Disease:Organ 1 0.0913 0.0913 0.2376 0.65145
## Residuals 4 1.5377 0.3844
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## First line.
sum(residuals(fit.null)^2) - sum(residuals(fit.a)^2)
## [1] 7.110201
## Second line.
sum(residuals(fit.a)^2) - sum(residuals(fit.ab)^2)
## [1] 3.114926
## Third line.
sum(residuals(fit.ab)^2) - sum(residuals(fit.abi)^2)
## [1] 0.09134405
## Final line.
sum(residuals(fit.abi)^2)
## [1] 1.537717
Rather than calculating \(SS\) sequentially we can calculate the \(SS\) for a given effect adjusting for all other effects listed in the model. This means that the \(SS_\text{A}\) and \(SS_\text{B}\) main effects will both be adjusted for each other (since neither contains the other), but will not be adjusted for \(SS_\text{A:B}\) (since it contains both A and B). \(SS_\text{A:B}\) will be adjusted for both main effects.
In our example above \(SS_\text{Disease}\) and \(SS_\text{Organ}\) main effects will both be adjusted for each other, but will not be adjusted for \(SS_\text{A:B}\) and \(SS_\text{Disease:Organ}\) will be adjusted for both main effects.
We can calculate the Type II \(SS\) table in R buy either
## Type II Organ SS
anova(fac_lm)[2, ]
## Analysis of Variance Table
##
## Response: logAUC
## Df Sum Sq Mean Sq F value Pr(>F)
## Organ 1 3.1149 3.1149 8.1027 0.04656 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Type II Disease SS
anova(fac_lm_2)[2, ]
## Analysis of Variance Table
##
## Response: logAUC
## Df Sum Sq Mean Sq F value Pr(>F)
## Disease 1 4.496 4.496 11.695 0.02678 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Type II Organ:Disease/Disease:Organ
anova(fac_lm_2)[3, ]
## Analysis of Variance Table
##
## Response: logAUC
## Df Sum Sq Mean Sq F value Pr(>F)
## Organ:Disease 1 0.091344 0.091344 0.2376 0.6514
or,
R function (does not matter which order the model has the terms specified):car::Anova(fac_lm, type = 2)
## Anova Table (Type II tests)
##
## Response: logAUC
## Sum Sq Df F value Pr(>F)
## Disease 4.4960 1 11.6953 0.02678 *
## Organ 3.1149 1 8.1027 0.04656 *
## Disease:Organ 0.0913 1 0.2376 0.65145
## Residuals 1.5377 4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note: Type III \(SS\) and beyond are not covered in this course but I would recommend reading this recent paper for an intuitive overview.
Balanced design
| Disease | Organ | logAUC |
|---|---|---|
| Healthy | innerLV | 9.40 |
| Healthy | outerLV | 8.83 |
| Healthy | innerLV | 10.33 |
| Healthy | outerLV | 10.49 |
| Healthy | innerLV | 9.74 |
| Healthy | outerLV | 10.98 |
| Diabetic | innerLV | 7.92 |
| Diabetic | outerLV | 9.37 |
| Diabetic | innerLV | 8.69 |
| Diabetic | outerLV | 11.31 |
| Diabetic | innerLV | 7.01 |
| Diabetic | outerLV | 9.29 |
| Grand mean |
|---|
| 9.447 |
| Disease | Organ | log AUC mean |
|---|---|---|
| Diabetic | innerLV | 7.873 |
| Diabetic | outerLV | 9.990 |
| Healthy | innerLV | 9.823 |
| Healthy | outerLV | 10.100 |
| Organ | log AUC mean |
|---|---|
| innerLV | 8.848 |
| outerLV | 10.045 |
| Disease | log AUC mean |
|---|---|
| Diabetic | 8.932 |
| Healthy | 9.962 |
Unbalanced design (Unequal replication due to missing data)
| Disease | Organ | logAUC |
|---|---|---|
| Healthy | innerLV | NA |
| Healthy | outerLV | NA |
| Healthy | innerLV | NA |
| Healthy | outerLV | 10.49 |
| Healthy | innerLV | 9.74 |
| Healthy | outerLV | 10.98 |
| Diabetic | innerLV | 7.92 |
| Diabetic | outerLV | 9.37 |
| Diabetic | innerLV | 8.69 |
| Diabetic | outerLV | NA |
| Diabetic | innerLV | 7.01 |
| Diabetic | outerLV | 9.29 |
| Grand mean |
|---|
| 9.186 |
| Disease | Organ | log AUC mean |
|---|---|---|
| Diabetic | innerLV | 7.873 |
| Diabetic | outerLV | 9.330 |
| Healthy | innerLV | 9.740 |
| Healthy | outerLV | 10.735 |
Everything is as we’d expect up until now, but what about the marginal means? The, perhaps, most obvious way would be to do the following (i.e., ignore subgroups, hence give all observations equal weight)?
unbalanced_nafree %>%
dplyr::select(c(Disease, logAUC)) %>%
group_by(Disease) %>%
summarise(Mean = mean(logAUC))
## # A tibble: 2 × 2
## Disease Mean
## <fct> <dbl>
## 1 Diabetic 8.46
## 2 Healthy 10.4
However, as a result of this the means are biased towards groups with greater replication! To avoid this we give the subgroups (Organ) cell means equal weight (this is sometimes called the least squares mean):
unbalanced_nafree %>%
dplyr::select(c(Disease, Organ, logAUC)) %>%
group_by(Organ, Disease) %>%
mutate(n = n()) %>% ## count observations in each group
## then calculate weighted mean based on the no. observations
summarise(weighted_mean = weighted.mean(logAUC, w = 1/n)) %>%
group_by(Disease) %>% ## calculate mean of weighted means
summarise(mean = mean(weighted_mean))
## # A tibble: 2 × 2
## Disease mean
## <fct> <dbl>
## 1 Diabetic 8.60
## 2 Healthy 10.2
Now see if you can do the same across Organ to get
| Organ | mean |
|---|---|
| innerLV | 8.807 |
| outerLV | 10.032 |
What does all of this tell us? When calculating marginal means for unbalanced designs we need to be careful! Luckily there are inbuilt functions in R that do this correctly for us! See the next section for an example using the predictmeans function from the predictmeans package.
Recall that each time we carry out a hypothesis test the probability we get a false positive result (type I error) is given by \(\alpha\) (the level of significance we choose).
When we have multiple comparisons to make we should then control the Type I error rate across the entire family of tests under consideration, i.e., control the Family-Wise Error Rate (FWER); this ensures that the risk of making at least one Type I error among the family of comparisons in the experiment is \(\alpha\).
| State of Nature | Don’t reject \(H_0\) | reject \(H_0\) |
|---|---|---|
| \(H_0\) is true | ✅ | Type I error |
| \(H_0\) is false | Type II error | ✅ |
The familywise error rate (FWER) is the risk of making at least one Type I error among the family of comparisons in the experiment. Now let’s consider carrying out \(m\) independent t-tests and let for any single test, let Pr(commit a Type 1 error) \(= \alpha_c\) be the per comparison error rate (PCER). So for a single test the probability a correct decision is made is \(1 - \alpha_c\). Therefore for \(m\) independent t-tests the probability of committing no Type I errors is \((1 - \alpha_c)^m\) and the probability of committing at least one Type I error is \(1 -(1 - \alpha_c)^m = \alpha_F\) which is the upper limit of the FWER.
The False Discovery Rate (FDR) controls the expected (mean) proportion of false discoveries among the \(R\) (out of \(m\)) hypotheses declared significant.
Consider testing \(m\) null hypotheses with corresponding p-values \(P_1, P_2,...,P_m\); we then order then so that \(P_{(1)} < P_{(2)} <...<P_{(m)}\) (where \(P_{(i)}\) is the \(i^{th}\) largest \(i=1,...,m\)). The \(i^{th}\) ordered p-value is calculated as \(\frac{i}{m}q^*\) and the \(i^{th}\) null hypothesis is rejected if \(P_i \leq \frac{i}{m}q^*\)
Calculating 95% confidence intervals for pairwise comparisons is similar to the procedure discussed in module 2:
\[\text{95% CI} = \text{estimate} \pm (\text{scale factor} \times \text{standard error}_\text{of estimate}).\] For the 95% CI for pairwise comparisons of means this becomes
\[\text{95% CI} = \text{difference} \pm (\text{scale factor} \times \text{SED}),\] where SED is the standard error of the difference (which depends on group replication). The choice of scale factor depends on what adjustments we might want to make. Below we cover three common adjustments. Specifically where (scale factor × SED) is the 1) Fisher correction LSD, 2) Bonferroni correction LSD, and 3) Tukey’s HSD.
To calculate the LSD the \(t\)-distribution is used, specifically the
\[t_{\alpha = \frac{\alpha_c}{2}, \text{df} = N - m}\] distribution, where \(N\) is the number of observations and \(m\) is the number of treatment groups. To obtain the critical value for \(\alpha_c = 0.05\), \(N = 12\) and \(m = 3\) the following code can be used in R.
qt(p = 0.05/2,df = 12 - 3, lower.tail = FALSE)
## [1] 2.262157
Fisher’s LSD is then \[t_{\alpha = \frac{\alpha_c}{2}, \text{df} = N - m} \times \text{SED}.\]
Here we carry out post-hoc tests only if the ANOVA F-test is significant. If so declare significant \(100\alpha\%\) any pairwise difference > LSD. This does not control the FWER.
To calculate Bonferroni correction the \(t\)-distribution is used, specifically the \[t_{\alpha = \frac{\alpha_c}{2 \times k}, \text{df} = N - m}\] distribution, where \(N\) is the number of observations, \(m\) is the number of treatment groups, and \(k = {m \choose 2}\) is the number of pairwise comparisons being made.
To obtain the critical value for \(\alpha_c = 0.05\), \(N = 12\) and \(m = 3\) the following code can be used in R.
qt(p = 0.05/(2 * choose(3,2)),df = 12 - 3, lower.tail = FALSE)
## [1] 2.933324
Bonferroni’s LSD is then \[t_{\alpha = \frac{\alpha_c}{2 \times k}, \text{df} = N - m} \times \text{SED}.\]
Here we reject the \(H_0\) for which the p-value, p-val, is p-val \(< \alpha_c = \frac{\alpha_f}{n_c}\) where \(\alpha_f\) is the FWER and \(n_c\) is the number of pairwise comparisons. However, this makes no assumptions about independence between tests.
This compares the mean of every treatment with the mean of every other treatment and uses a studentised range distribution compared with a \(t\)-distribution for Fisher’s LSD and the Bonferroni correction. Therefore to calculate the HSD the studentised range distribution is used, specifically the \[q_{ 1- \alpha_c, m, \text{df} = N - m}\] distribution, where \(N\) is the number of observations, \(m\) is the number of treatment groups, and \(q_{\alpha,m,\text{df}}\) quantile of the studendised distribution.
To obtain the critical value for \(\alpha_c = 0.05\), \(N = 12\) and \(m = 3\) the following code can be used in R.
1 - qtukey(p = 1 - 0.05, nmeans = 3, df = 12 - 3)
## [1] -2.948492
Tukey’s HSD is given by \[\frac{q_{ 1 - \alpha_c, m, \text{df} = N - m}}{\sqrt{2}} \times \sqrt{\frac{2\hat{\sigma}^2}{n}}.\] Here \(\hat{\sigma}^2\) is the residual mean square error and \(n\) is the assumed equal number of replicates in each group.
Suppose we have a number \(m\) of null hypotheses, \(H_1, H_2, ..., H_m\). Using the traditional parlance we reject the null hypothesis if the test is declared significant and have no evidence to reject the null hypothesis if the test is “not significant”. Now, summing each type of outcome over all \(H_i (i = 1.,..,m)\) yields the following random variables:
| Null hypothesis is true (H0) | Alternative hypothesis is true (HA) | Total | |
|---|---|---|---|
| Test is declared significant | V | S | R |
| Test is declared non-significant | U | T | m - R |
| Total | \(m_{0}\) | \(m - m_0\) | m |
Recall the CRD rats data
| Surgery | Rat | logAUC |
|---|---|---|
| C | 1 | 8.49 |
| C | 2 | 8.20 |
| C | 3 | 9.08 |
| C | 4 | 8.07 |
| P | 1 | 10.24 |
| P | 2 | 7.72 |
| P | 3 | 9.34 |
| P | 4 | 8.50 |
| S | 1 | 11.31 |
| S | 2 | 12.69 |
| S | 3 | 11.37 |
| S | 4 | 10.82 |
You can read this data directly into R using
rats <- readr::read_csv("https://raw.githubusercontent.com/STATS-UOA/databunker/master/data/crd_rats_data.csv")
To use predictmeans later on we have to ensure that the relevant variables are coded as factors:
rats <- rats %>%
mutate(Surgery = as.factor(Surgery))
We can fit a linear model using lm():
rats_lm <- lm(logAUC ~ Surgery, data = rats)
coef(rats_lm)
## (Intercept) SurgeryP SurgeryS
## 8.4600 0.4900 3.0875
Our fitted model is therefore
\[ \begin{aligned} \operatorname{\widehat{logAUC}} &= 8.46 + 0.49(\operatorname{Surgery}_{\operatorname{P}}) + 3.09(\operatorname{Surgery}_{\operatorname{S}}) \end{aligned} \]
Using predictmeans
By default Fisher’s comparisons are made.
pm <- predictmeans::predictmeans(rats_lm , modelterm = "Surgery",
pairwise = TRUE, plot = FALSE)
str(pm)
## List of 10
## $ Predicted Means : 'xtabs' num [1:3(1d)] 8.46 8.95 11.55
## ..- attr(*, "dimnames")=List of 1
## .. ..$ Surgery: chr [1:3] "C" "P" "S"
## ..- attr(*, "call")= language xtabs(formula = pm ~ ., data = mt[, c("pm", vars)], drop.unused.levels = TRUE)
## $ Standard Error of Means : Named num 0.41
## ..- attr(*, "names")= chr "All means have the same SE"
## $ Standard Error of Differences: Named num [1:3] 0.58 0.58 0.58
## ..- attr(*, "names")= chr [1:3] "Max.SED" "Min.SED" "Aveg.SED"
## $ LSD : Named num [1:3] 1.31 1.31 1.31
## ..- attr(*, "names")= chr [1:3] "Max.LSD" "Min.LSD" "Aveg.LSD"
## ..- attr(*, "Significant level")= num 0.05
## ..- attr(*, "Degree of freedom")= num 9
## $ Pairwise LSDs : num [1:3, 1:3] 0 1.31 1.31 -0.49 0 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3] "C" "P" "S"
## .. ..$ : chr [1:3] "C" "P" "S"
## ..- attr(*, "Significant level")= num 0.05
## ..- attr(*, "Degree of freedom")= int 9
## ..- attr(*, "Note")= chr "LSDs matrix has mean differences (row-col) above the diagonal, LSDs (adjusted by 'none' method) below the diagonal"
## $ Pairwise p-value : num [1:3, 1:3] 0 0.4202 0.0005 -0.8446 0 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3] "C" "P" "S"
## .. ..$ : chr [1:3] "C" "P" "S"
## ..- attr(*, "Degree of freedom")= int 9
## ..- attr(*, "Note")= chr "The matrix has t-value above the diagonal, p-value (adjusted by 'none' method) below the diagonal"
## $ predictmeansPlot :List of 2
## ..$ : NULL
## ..$ : NULL
## $ predictmeansBarPlot : NULL
## $ mean_table :List of 2
## ..$ Table:'data.frame': 3 obs. of 7 variables:
## .. ..$ Surgery : Factor w/ 3 levels "C","P","S": 1 2 3
## .. ..$ Mean : num [1:3] 8.46 8.95 11.55
## .. ..$ SE : num [1:3] 0.41 0.41 0.41
## .. ..$ Df : int [1:3] 9 9 9
## .. ..$ LL(95%) : num [1:3] 7.53 8.02 10.62
## .. ..$ UL(95%) : num [1:3] 9.39 9.88 12.48
## .. ..$ LetterGrp: chr [1:3] "A " "A " " B"
## ..$ Note : chr "Letter-based representation of pairwise comparisons at significant level '0.05'"
## $ p_valueMatrix : num [1:3, 1:3] 0 0.4202 0.0005 0 0 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3] "C" "P" "S"
## .. ..$ : chr [1:3] "C" "P" "S"
## - attr(*, "class")= chr "pdmlist"
Let’s look at some of the pm elements more closely:
pm$mean_table
## $Table
## Surgery Mean SE Df LL(95%) UL(95%) LetterGrp
## 1 C 8.4600 0.41025 9 7.5319 9.3881 A
## 2 P 8.9500 0.41025 9 8.0219 9.8781 A
## 3 S 11.5475 0.41025 9 10.6194 12.4756 B
##
## $Note
## [1] "Letter-based representation of pairwise comparisons at significant level '0.05'"
This gives us the estimated group means and associated 95% CIs. Why are the standard errors all equal?
But, what we’d like is the pairwise comparisons between groups. Information pertaining to this is also returned:
print(pm$`Pairwise p-value`)
## C P S
## C 0.0000 -0.8446 -5.3216
## P 0.4202 0.0000 -4.4770
## S 0.0005 0.0015 0.0000
## attr(,"Degree of freedom")
## [1] 9
## attr(,"Note")
## [1] "The matrix has t-value above the diagonal, p-value (adjusted by 'none' method) below the diagonal"
This gives us the pairwise comparison statistic (the \(t\)-statistic in this case) on the upper diagonal and the associated p-value’s on the lower diagonal.
What about the 95%CI for the comparisons? Luckily predictmeans also returns the pairwise LSD values (Fisher’s by default with \(\alpha_c = 0.05\)):
pm$`Pairwise LSDs`
## C P S
## C 0.00000 -0.49000 -3.0875
## P 1.31247 0.00000 -2.5975
## S 1.31247 1.31247 0.0000
## attr(,"Significant level")
## [1] 0.05
## attr(,"Degree of freedom")
## [1] 9
## attr(,"Note")
## [1] "LSDs matrix has mean differences (row-col) above the diagonal, LSDs (adjusted by 'none' method) below the diagonal"
Here, the upper diagonal matrix has the pairwise differences and the lower has the LSD values.
So, we have all the information to construct the Fisher LSD 95% CIs! You could extract all the information and construct a pairwise comparison table manually, or use a pre-written function (aren’t I nice to provide one!):
url <- "https://gist.github.com/cmjt/72f3941533a6bdad0701928cc2924b90"
devtools::source_gist(url, quiet = TRUE)
comparisons(pm)
## Comparison Difference SED LSD lwr upr t p
## 1 C-P -0.490 0.58 1.312 -1.802 0.822 -0.845 0.4202
## 2 C-S -3.087 0.58 1.312 -4.400 -1.775 -5.322 0.0005
## 3 P-S -2.598 0.58 1.312 -3.910 -1.285 -4.477 0.0015
To use the Bonferroni correction we must now calculate and specify the adjusted \(\alpha_b = \frac{\alpha_b}{n_c}\), in our case this is \(\frac{0.05}{{3 \choose 2}}\). We also specify adj = "bonferroni":
alpha.adj <- 0.05/choose(3,2)
bonferroni <- predictmeans::predictmeans(rats_lm ,
modelterm = "Surgery", adj = "bonferroni",
level = alpha.adj,
pairwise = TRUE, plot = FALSE)
comparisons(bonferroni)
Things are a bit more cumbersome when it comes to Tukey’s HSD. We can specify adj = "tukey" in predictmeans:
tukey <- predictmeans::predictmeans(rats_lm ,
modelterm = "Surgery", adj = "tukey",
level = alpha.adj,
pairwise = TRUE, plot = FALSE)
names(tukey)
However, the HSD values are not returned, so we cannot calculate out pairwise CIs. R has an inbuilt function TukeyHSD that does this for us, but as we can see below the HSD values are not returned.
TukeyHSD(aov(logAUC~Surgery, data = rats))
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = logAUC ~ Surgery, data = rats)
##
## $Surgery
## diff lwr upr p adj
## P-C 0.4900 -1.1298813 2.109881 0.6863267
## S-C 3.0875 1.4676187 4.707381 0.0012479
## S-P 2.5975 0.9776187 4.217381 0.0039400
From above we saw that Tukey’s HSD was given by
\[\frac{q_{1-\alpha_c, m, \text{df} = N - m}}{\sqrt{2}} \times \sqrt{\frac{2\hat{\sigma}^2}{n}}.\]
We can calculate \(q_{1 - \alpha_c, m, \text{df} = N - m}\) for our data using
qtukey(p = 1- 0.05, nmeans = 3, df = 12 - 3)
## [1] 3.948492
Assuming equal replicates (\(n = 4\), with \(\hat{\sigma}^2\) as above) we calculate the SED as
sqrt(2 * anova(rats_lm)[2,3] / 4)
## [1] 0.5801856
Therefore, Tukey’s HSD is \(\frac{3.9484922}{\sqrt{2}}\times 0.5801856 = 1.6198813.\)
To calculate the CIs we use \(\text{difference} \pm \text{HSD}\):
HSD <- (qtukey(p = 1 - 0.05, nmeans = 3, df = 12 - 3)/sqrt(2))*sqrt(2 * anova(rats_lm)[2,3] / 4)
all_diffs <- outer(tukey$`Predicted Means`, tukey$`Predicted Means`, "-")
comparison_table <- data.frame(differences = all_diffs[lower.tri(all_diffs)]) %>%
mutate(upper = differences + HSD, lower = differences - HSD, HSD = HSD)
comparison_table
This matches the output from TukeyHSD() above!
What about the values returned by predictmeans()?
print(tukey$`Pairwise p-value`)
There is an easier way to see the same pairwise contrasts using the emmeans function from the package of the same name and the pairs function:
em <- emmeans::emmeans(rats_lm, specs = "Surgery")
pairs(em, adjust = "tukey")
## contrast estimate SE df t.ratio p.value
## C - P -0.49 0.58 9 -0.845 0.6863
## C - S -3.09 0.58 9 -5.322 0.0012
## P - S -2.60 0.58 9 -4.477 0.0039
##
## P value adjustment: tukey method for comparing a family of 3 estimates
This, gives us the same information as predictmeans but in an easier to read format (still no CIs though, we still have to calculate those ourselves!)
However, the emmeanspackage facilitates some nice plotting of the pairwise comparisons:
plot(em, comparisons = TRUE) + theme_bw()
Recall, blocking helps control variability by making treatment groups more alike. Experimental units are divided into subsets (called blocks) so that units within the same block are more similar than units from different subsets or blocks. Blocking is a technique for dealing with nuisance factors. A nuisance factor is a factor that has some effect on the response, but is of no interest (e.g., age class).
Fixed effects are terms (parameters) in a statistical model which are fixed, or non-random, quantities (e.g., treatment group’s mean response). For the same treatment, we expect this quantity to be the same from experiment to experiment.
Random effects are terms (parameters) in a statistical model which are considered as random quantities or variables (e.g., block id). Specifically, terms whose levels are a representative sample from a population, and where the variance of the population is of interest should be allocated as random. Setting a block as a random effect allows us to infer variation between blocks as well as the variation between experimental units within blocks.
Key idea: Partition known sources of variation which are unimportant to key scientific question(s) to improve precision of comparisons between treatment means.
| Run | Surgery | Rat | logAUC8 |
|---|---|---|---|
| 1 | C | 1 | 9.24 |
| 1 | P | 2 | 8.81 |
| 1 | S | 3 | 10.75 |
| 2 | C | 4 | 3.89 |
| 2 | P | 5 | 8.62 |
| 2 | S | 6 | 10.24 |
| 3 | C | 7 | 8.42 |
| 3 | P | 8 | 9.93 |
| 3 | S | 9 | 11.68 |
| 4 | C | 10 | 8.77 |
| 4 | P | 11 | 10.86 |
| 4 | S | 12 | 13.05 |
You can read these data directly into R:
rcbd <- readr::read_csv("https://raw.githubusercontent.com/STATS-UOA/databunker/master/data/rcbd.csv")
To use predictmeans later on we have to ensure that the relevant variables are coded as factors:
rcbd <- rcbd %>%
mutate(Run = as.factor(Run)) %>%
mutate(Surgery = as.factor(Surgery))
Run as a fixed effectlm <- lm(logAUC8 ~ Run + Surgery, data = rcbd)
summary(lm)
##
## Call:
## lm(formula = logAUC8 ~ Run + Surgery, data = rcbd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7517 -0.3683 -0.0900 0.4508 1.5817
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.6583 0.8506 9.004 0.000105 ***
## Run2 -2.0167 0.9822 -2.053 0.085854 .
## Run3 0.4100 0.9822 0.417 0.690882
## Run4 1.2933 0.9822 1.317 0.235963
## SurgeryP 1.9750 0.8506 2.322 0.059293 .
## SurgeryS 3.8500 0.8506 4.526 0.003991 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.203 on 6 degrees of freedom
## Multiple R-squared: 0.8449, Adjusted R-squared: 0.7157
## F-statistic: 6.538 on 5 and 6 DF, p-value: 0.02034
Run as a random effectThere are, confusingly, two ways of fitting the same model. For inference we require both!
Option 1 uses the lmer function from the lme4 package:
lmer4_mod <- lme4::lmer(logAUC8 ~ Surgery + (1|Run), data = rcbd)
summary(lmer4_mod)
## Linear mixed model fit by REML ['lmerMod']
## Formula: logAUC8 ~ Surgery + (1 | Run)
## Data: rcbd
##
## REML criterion at convergence: 37.2
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.8525 -0.2273 0.1772 0.4036 1.3309
##
## Random effects:
## Groups Name Variance Std.Dev.
## Run (Intercept) 1.479 1.216
## Residual 1.447 1.203
## Number of obs: 12, groups: Run, 4
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 7.5800 0.8552 8.863
## SurgeryP 1.9750 0.8506 2.322
## SurgeryS 3.8500 0.8506 4.526
##
## Correlation of Fixed Effects:
## (Intr) SrgryP
## SurgeryP -0.497
## SurgeryS -0.497 0.500
Option 2 uses the lmer function from the lmerTest package:
lmerTest_mod <- lmerTest::lmer(logAUC8 ~ Surgery + (1|Run), data = rcbd)
summary(lmerTest_mod)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: logAUC8 ~ Surgery + (1 | Run)
## Data: rcbd
##
## REML criterion at convergence: 37.2
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.8525 -0.2273 0.1772 0.4036 1.3309
##
## Random effects:
## Groups Name Variance Std.Dev.
## Run (Intercept) 1.479 1.216
## Residual 1.447 1.203
## Number of obs: 12, groups: Run, 4
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 7.5800 0.8552 5.9567 8.863 0.000119 ***
## SurgeryP 1.9750 0.8506 6.0000 2.322 0.059293 .
## SurgeryS 3.8500 0.8506 6.0000 4.526 0.003991 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) SrgryP
## SurgeryP -0.497
## SurgeryS -0.497 0.500
As you can see they give the same output! Why bother, you might ask?! This will become apparent later on.
Inference about the random effects
We have two variance components
Note that aov() presents the same information, but in a different way:
summary(aov(logAUC8 ~ Surgery + Error(Run), data = rcbd))
##
## Error: Run
## Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 3 17.65 5.883
##
## Error: Within
## Df Sum Sq Mean Sq F value Pr(>F)
## Surgery 2 29.652 14.826 10.25 0.0116 *
## Residuals 6 8.682 1.447
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lmer)Inference about the fixed effects
Specifying Run as random effect changes our estimated baseline (i.e., Intercept coefficient) as now and effect due to Run is attributed to the structural component of the model.
We can interpret the fixed effects of a LMM as we might for a linear model (now the Intercept estimate changes depending on Run).
coefficients(lmer4_mod)
## $Run
## (Intercept) SurgeryP SurgeryS
## 1 7.639067 1.975 3.85
## 2 6.118411 1.975 3.85
## 3 7.948225 1.975 3.85
## 4 8.614297 1.975 3.85
##
## attr(,"class")
## [1] "coef.mer"
coefficients(lmerTest_mod)
## $Run
## (Intercept) SurgeryP SurgeryS
## 1 7.639067 1.975 3.85
## 2 6.118411 1.975 3.85
## 3 7.948225 1.975 3.85
## 4 8.614297 1.975 3.85
##
## attr(,"class")
## [1] "coef.mer"
What about an ANOVA table? NOTE that specifying the type of \(SS\) (e.g., Type I, II, or III) in an anova call only works with a model fitted using lmerTest::lmer:
anova(lmerTest_mod, type = 1)
## Type I Analysis of Variance Table with Satterthwaite's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## Surgery 29.652 14.826 2 6 10.246 0.01162 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lmerTest_mod, type = 2)
## Type II Analysis of Variance Table with Satterthwaite's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## Surgery 29.652 14.826 2 6 10.246 0.01162 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now, as we only have a single fixed effect in our model (Surgery) the ANOVA Type I and Type II tables above are equivalent!
Inference using predictmeans() (Note: output will be the same for the lmerTest model.)
library(predictmeans)
pm <- predictmeans(lmer4_mod, modelterm = "Surgery", pairwise = TRUE, plot = FALSE)
Using the elements of the predictmeans object (as in the last section) we can extract the pairwise comparison information:
pm$`Pairwise LSDs`
## C P S
## C 0.00000 -1.97500 -3.850
## P 2.08499 0.00000 -1.875
## S 2.08499 2.08499 0.000
## attr(,"Significant level")
## [1] 0.05
## attr(,"Degree of freedom")
## [1] 5.95671
## attr(,"Note")
## [1] "LSDs matrix has mean differences (row-col) above the diagonal, LSDs (adjusted by 'none' method) below the diagonal"
print(pm$`Pairwise p-value`)
## C P S
## C 0.0000 -2.3219 -4.5263
## P 0.0596 0.0000 -2.2043
## S 0.0041 0.0700 0.0000
## attr(,"Degree of freedom")
## [1] 5.95671
## attr(,"Note")
## [1] "The matrix has t-value above the diagonal, p-value (adjusted by 'none' method) below the diagonal"
Gives us 1) pairwise differences (upper diagonal) and the LSD values (lower diagonal), and 2) the pairwise comparison statistic (the \(t\)-statistic in this case) on the upper diagonal and the associated p-value’s on the lower diagonal.
So is there a pairwise difference in means? Let’s organise the information in a table.
## Comparison Difference SED LSD lwr upr t p
## 1 C-P -1.975 0.851 2.085 -4.060 0.110 -2.322 0.0596
## 2 C-S -3.850 0.851 2.085 -5.935 -1.765 -4.526 0.0041
## 3 P-S -1.875 0.851 2.085 -3.960 0.210 -2.204 0.0700
Have a look at the CIs and p-values. What do you conclude?
We could also plot the pairwise comparisons:
| Disease | Organ | Animal | Sample | logAUC |
|---|---|---|---|---|
| Healthy | innerLV | 1 | 1 | 9.40 |
| Healthy | outerLV | 1 | 2 | 8.83 |
| Healthy | innerLV | 2 | 1 | 10.33 |
| Healthy | outerLV | 2 | 2 | 10.49 |
| Healthy | innerLV | 3 | 1 | 9.74 |
| Healthy | outerLV | 3 | 2 | 10.98 |
| Diabetic | innerLV | 4 | 1 | 7.92 |
| Diabetic | outerLV | 4 | 2 | 9.37 |
| Diabetic | innerLV | 5 | 1 | 8.69 |
| Diabetic | outerLV | 5 | 2 | 11.31 |
| Diabetic | innerLV | 6 | 1 | 7.01 |
| Diabetic | outerLV | 6 | 2 | 9.29 |
You can read these data directly into R:
split_plot <- readr::read_csv("https://raw.githubusercontent.com/STATS-UOA/databunker/master/data/split_plot.csv")
To use predictmeans later on we have to ensure that the relevant variables are coded as factors:
split_plot <- split_plot %>%
mutate(Animal = as.factor(Animal)) %>%
mutate(Disease = as.factor(Disease))%>%
mutate(Organ = as.factor(Organ))
Animal as a random effectUsing aov()
sp_aov <- aov(logAUC ~ Disease*Organ + Error(Animal), data = split_plot)
summary(sp_aov)
##
## Error: Animal
## Df Sum Sq Mean Sq F value Pr(>F)
## Disease 1 3.183 3.183 2.187 0.213
## Residuals 4 5.822 1.456
##
## Error: Within
## Df Sum Sq Mean Sq F value Pr(>F)
## Organ 1 4.296 4.296 14.423 0.0191 *
## Disease:Organ 1 2.539 2.539 8.525 0.0433 *
## Residuals 4 1.191 0.298
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Using lmer() (from lmeTest and lmer4)
Recall that specifying the type of \(SS\) (e.g., Type I, II, or III) in an anova call only works with a model fitted using lmerTest::lmer. As we now have specified an interaction model then the type of \(SS\) calculated will have an effect on inference for unbalanced designs.
sp_lmer <- lmerTest::lmer(logAUC ~ Disease*Organ + (1|Animal),
data = split_plot)
anova(sp_lmer,type = 1)
## Type I Analysis of Variance Table with Satterthwaite's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## Disease 0.6513 0.6513 1 4 2.1866 0.21329
## Organ 4.2960 4.2960 1 4 14.4227 0.01914 *
## Disease:Organ 2.5392 2.5392 1 4 8.5246 0.04326 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(sp_lmer,type = 2)
## Type II Analysis of Variance Table with Satterthwaite's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## Disease 0.6513 0.6513 1 4 2.1866 0.21329
## Organ 4.2960 4.2960 1 4 14.4227 0.01914 *
## Disease:Organ 2.5392 2.5392 1 4 8.5246 0.04326 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note here, though, our design is balanced hence the ordering of terms in our model does not make a difference (no need to specify Type II \(SS\))
Pairwise comparison of time means
When a design has blocking, to get summary stats using predictmeans you should fit the model using lme4::lmer():
Calling predictmeans alone produces plots:
lmer <- lme4::lmer(logAUC ~ Disease*Organ + (1|Animal),
data = split_plot)
predmeans <- predictmeans::predictmeans(model = lmer ,modelterm = "Disease:Organ",
pairwise = TRUE)
predmeans$predictmeansPlot
## [[1]]
##
## [[2]]
Recall from previous sections that the LSD vale is essentially the buffer around the point estimate (the radius of the CI if you like), beyond the limit of which we might believe there to be a significant difference (a bit lax with terminology there!).
How would be interpret the above plot? I would conclude that there is a big difference between Diabetic and Healthy for InnerLV wall, but no difference for the outerLV wall. We can construct the pairwise comparison table:
comparisons(predmeans)
## Comparison Difference SED LSD lwr upr t
## 1 Diabetic:innerLV-Diabetic:outerLV -2.117 0.446 1.111 -3.228 -1.006 -4.750
## 2 Diabetic:innerLV-Healthy:innerLV -1.950 0.764 1.906 -3.856 -0.044 -2.551
## 3 Diabetic:innerLV-Healthy:outerLV -2.227 0.764 1.906 -4.133 -0.321 -2.913
## 4 Diabetic:outerLV-Healthy:innerLV 0.167 0.765 1.906 -1.739 2.073 0.218
## 5 Diabetic:outerLV-Healthy:outerLV -0.110 0.764 1.906 -2.016 1.796 -0.144
## 6 Healthy:innerLV-Healthy:outerLV -0.277 0.446 1.111 -1.388 0.834 -0.621
## p
## 1 0.0038
## 2 0.0464
## 3 0.0293
## 4 0.8352
## 5 0.8907
## 6 0.5592
What do the CI coverages indicate?
We can also use emmeans to plot the pairwise comparisons
em <- emmeans::emmeans(lmer, ~Disease*Organ)
plot(em, pairwise = TRUE) + theme_bw()
From all the output above, what do you conclude about the interaction effect, if any!
Here we will use time points as factors. To read these data directly into R use
liver <- readr::read_csv("https://raw.githubusercontent.com/STATS-UOA/databunker/master/data/repeated_measures_liver.csv")
liver <- liver %>%
mutate(Time = as.factor(Time)) %>%
mutate(Treatment = as.factor(Treatment))
liver
## # A tibble: 210 × 4
## Animal Treatment Time Glucose
## <chr> <fct> <fct> <dbl>
## 1 Control1 Control 0 1.60
## 2 Control1 Control 5 1.28
## 3 Control1 Control 10 1.60
## 4 Control1 Control 15 1.58
## 5 Control1 Control 20 1.28
## 6 Control1 Control 25 1.44
## 7 Control1 Control 30 1.18
## 8 Control2 Control 0 0.84
## 9 Control2 Control 5 0.64
## 10 Control2 Control 10 0.7
## # ℹ 200 more rows
Below these data are plotted.
Animal as a random effectUsing aov()
re_aov <- aov(Glucose ~ Treatment*Time + Error(Animal),data = liver)
summary(re_aov)
##
## Error: Animal
## Df Sum Sq Mean Sq F value Pr(>F)
## Treatment 3 1.90 0.6335 1.407 0.263
## Residuals 26 11.71 0.4503
##
## Error: Within
## Df Sum Sq Mean Sq F value Pr(>F)
## Time 6 0.7973 0.13289 8.732 0.0000000334 ***
## Treatment:Time 18 0.2539 0.01411 0.927 0.547
## Residuals 156 2.3741 0.01522
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Using lmer() (from lmerTest and lme4)
Recall that specifying the type of \(SS\) (e.g., Type I, II, or III) in an anova call only works with a model fitted using lmerTest::lmer. As we now have specified an interaction model then the type of \(SS\) calculated will have an effect on inference for unbalanced designs.
re_lmer <- lmerTest::lmer(Glucose ~ Treatment*Time + (1|Animal), data = liver)
anova(re_lmer,type = 2)
## Type II Analysis of Variance Table with Satterthwaite's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## Treatment 0.06423 0.021409 3 26 1.4068 0.2632
## Time 0.79731 0.132885 6 156 8.7318 0.00000003345 ***
## Treatment:Time 0.25390 0.014105 18 156 0.9269 0.5474
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Pairwise comparison of time means
As above, when a design has blocking, to get summary stats using predictmeans you should fit the model using lme4::lmer():
re_lmer4 <- lme4::lmer(Glucose ~ Treatment*Time + (1|Animal),data = liver)
predmeans <- predictmeans::predictmeans(model = re_lmer4 ,modelterm = "Time",
pairwise = TRUE, plot = TRUE)
How would be interpret the above plot?
What about the pairwise comparison table?
comparisons(predmeans)
## Comparison Difference SED LSD lwr upr t p
## 1 0-5 0.058 0.032 0.065 -0.006 0.123 1.830 0.0752
## 2 0-10 0.080 0.032 0.065 0.016 0.145 2.512 0.0164
## 3 0-15 0.111 0.032 0.065 0.046 0.176 3.473 0.0013
## 4 0-20 0.149 0.032 0.065 0.085 0.214 4.675 0.0000
## 5 0-25 0.145 0.032 0.065 0.080 0.209 4.531 0.0001
## 6 0-30 0.205 0.032 0.065 0.140 0.270 6.421 0.0000
## 7 5-10 0.022 0.032 0.065 -0.043 0.086 0.682 0.4993
## 8 5-15 0.052 0.032 0.065 -0.012 0.117 1.643 0.1089
## 9 5-20 0.091 0.032 0.065 0.026 0.155 2.845 0.0072
## 10 5-25 0.086 0.032 0.065 0.022 0.151 2.701 0.0103
## 11 5-30 0.147 0.032 0.065 0.082 0.211 4.591 0.0000
## 12 10-15 0.031 0.032 0.065 -0.034 0.095 0.960 0.3431
## 13 10-20 0.069 0.032 0.065 0.004 0.134 2.163 0.0370
## 14 10-25 0.064 0.032 0.065 0.000 0.129 2.018 0.0508
## 15 10-30 0.125 0.032 0.065 0.060 0.189 3.909 0.0004
## 16 15-20 0.038 0.032 0.065 -0.026 0.103 1.202 0.2367
## 17 15-25 0.034 0.032 0.065 -0.031 0.098 1.058 0.2968
## 18 15-30 0.094 0.032 0.065 0.029 0.159 2.948 0.0055
## 19 20-25 -0.005 0.032 0.065 -0.069 0.060 -0.144 0.8859
## 20 20-30 0.056 0.032 0.065 -0.009 0.120 1.746 0.0891
## 21 25-30 0.060 0.032 0.065 -0.004 0.125 1.890 0.0665
What do the CI coverages indicate?
We can also use emmeans to plot the pairwise comparisons
em <- emmeans::emmeans(re_lmer4, ~Treatment*Time)
plot(em, pairwise = TRUE) + theme_bw() + facet_wrap(~Treatment)
Learning objectives
RR“Ecology has increasingly become a data- and model-centric discipline…”
Recall, the simple linear regression model from module 2:
\[Y_i = \alpha + \beta_1x_i + \epsilon_i\] where
\[\epsilon_i \sim \text{Normal}(0,\sigma^2).\]
Here for observation \(i\)
We also saw a different specification of this model in module 3:
There is an alternative, equivalent way of specifying the linear regression model which attributes the randomness directly to the response variable rather than the error \(\epsilon_i\):
\[Y_i \sim \text{Normal}(\alpha + \beta_1 x_i, \sigma^2).\]
That is, we assume the \(i^{th}\) observation’s response, \(Y_i\), comes from a normal distribution with mean \(\mu_i = \alpha + \beta_1 x_i\) and variance \(\sigma^2\).
In this case we assume that
But, what if we want to be a little more flexible and move away from some of these assumptions? What if we want to rid ourselves from a model with normal errors? The answer, Generalised Linear Models (GLMs).
Poisson regression is commonly used as a distribution for counts. It is a discrete distribution (of positive values only) and has \(\text{Var}(Y_i) = \mu_i\) (i.e., we expect the variance to increase with the mean).
If we were to assume (as previously) that \(\mu_i = \alpha + \beta_1 x_i\) then we would be allowing \(\mu < 0\), which is not supported by the Poisson distribution. So, we use a link function to map between \(\mu_i\) and the real number line:
\[\text{log}(\mu_i) = \alpha + \beta_1 x_i.\]
So, \(\mu_i \geq 0\) and \(-\infty < \text{log}(\mu_i) < \infty\); however negative the linear predictor \(\alpha + \beta_1 x_i\) gets \(\mu_i\) will always be positive.
Equivalently, \[ \mu_i = \text{exp}(\alpha + \beta_1 x_i)\]
and
\[Y_i \sim \text{Poisson}(\mu_i)\]
Recall, for a linear regression model \(\mu = \alpha + \beta_1x\), when \(x=0\) \(y = \alpha\) and for every one-unit increase in \(x\), \(y\) increases by amount \(\beta_1\).
For a Poisson regression model we have \(\text{log}(\mu) = \alpha + \beta_1 x.\)
We can interpret this in the same way! That is, when \(x\) is zero, the log of the expected value of the response equals \(\alpha\) and for every one-unit increase in \(x\), the log of the expected value of the response increases by amount \(\beta_1\). But interpreting the effect of \(x\) on the log of the expected value is not straightforward.
Now, we have
\[\mu = \text{e}^{ \alpha + \beta_1 x} = \text{e}^{ \alpha}(\text{e}^{\beta_1})^{ x}.\] This implies that
Therefore, for every n-unit increase in x, the expected value of the response is multiplied by \(\text{e}^{n\beta_1}.\)
Typically, we use Poisson regression to
But, how do we assess if our choice was appropriate? Use the deviance!
For a fitted Poisson regression the deviance, \(D\), is
\[D = 2 \sum^{n}_{i=1} \{ Y_{i} \log(Y_{i}/\mu_{i}) - (Y_{i}-\mu_{i}) \}\]
where if \(Y_i=0\), the \(Y_{i} \log(Y_{i}/\mu_{i})\) term is taken to be zero, and \(\mu_{i} = \exp(\hat{\beta}_{0} + \hat{\beta}_{1}X_{1} + ... + \hat{\beta}_{p} X_{p})\) is the predicted mean for observation \(i\) based on the estimated model parameters.
The deviance is a measure of how well the model fits the data. That is, if the model fits well, the observed values \(Y_{i}\) will be close to their predicted means \(\mu_{i}\), causing both of the terms in \(D\) to be small, and so the deviance to be small. The flip side of this is that a large deviance indicates a bad fitting model.
Formally, we can test the null hypothesis that the model is correct by calculating a p-value using \[ p = \Pr(\chi^2_{n - k} > D).\]
Conditions of the chi-squared approximation
The distribution of the deviance under the null hypothesis is approximately chi-squared if the response of each observation is well approximated by a normal distribution. This holds for Poisson random variables with \(\mu_i > 5\).
However, if our chi-squared approximation assumptions are not met we should find another way.
A recent publication Partitioning beta diversity to untangle mechanisms underlying the assembly of bird communities in Mediterranean olive groves investigates bird abundance data for a number of olive farms. Each farm is catalogued according to the level of landscape complexity (high; intermediate; low) and the type of management of the ground cover (extensive or intensive).
library(tidyverse)
data <- read_delim("https://raw.githubusercontent.com/STATS-UOA/databunker/master/data/bird_abundance.csv")
birds <- data %>%
dplyr::select(c("OliveFarm","Management","Turdus_merula","Phylloscopus_collybita")) %>%
pivot_longer(., c(-OliveFarm, -Management), names_to = "Species", values_to = "Count")
birds
## # A tibble: 80 × 4
## OliveFarm Management Species Count
## <chr> <chr> <chr> <dbl>
## 1 MORALEDAEXP intensive Turdus_merula 21
## 2 MORALEDAEXP intensive Phylloscopus_collybita 14
## 3 MORALEDACON extensive Turdus_merula 29
## 4 MORALEDACON extensive Phylloscopus_collybita 21
## 5 OJUELOSEXP intensive Turdus_merula 5
## 6 OJUELOSEXP intensive Phylloscopus_collybita 31
## 7 OJUELOSCON extensive Turdus_merula 12
## 8 OJUELOSCON extensive Phylloscopus_collybita 52
## 9 LUNAEXP extensive Turdus_merula 103
## 10 LUNAEXP extensive Phylloscopus_collybita 10
## # ℹ 70 more rows
Fitting a poisson model we specify family = "poisson" in a call to glm(). Note that the default link function for family = "poisson" is the log link; we could also use the equivalent syntax poisson(link = log) to specify this model. Or, we could change the link function to something else (e.g., poisson(link = identity)) that makes sense.
glm_bird <- glm(Count ~ Species, data = birds, family = "poisson")
summary(glm_bird)
##
## Call:
## glm(formula = Count ~ Species, family = "poisson", data = birds)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.06105 0.03422 89.45 <2e-16 ***
## SpeciesTurdus_merula 0.94537 0.04032 23.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2234.6 on 79 degrees of freedom
## Residual deviance: 1621.9 on 78 degrees of freedom
## AIC: 2030.1
##
## Number of Fisher Scoring iterations: 5
The fitted model is therefore
\[ \log ({ \widehat{E( \operatorname{Count} )} }) = 3.061 + 0.945(\operatorname{Species}_{\operatorname{Turdus\_merula}}) \]
Interpreting the coefficients above we estimate that the log of the expected number of Phylloscopus collybita is 3.06 and the log of the expected number of Turdus merula is 4.01.
But what about the expected number?
exp(coef(glm_bird))
## (Intercept) SpeciesTurdus_merula
## 21.35000 2.57377
Therefore, we estimate that the expected average number of Phylloscopus collybita is 21.35 and the expected average number of Turdus merula is 54.95.
Using confidence intervals:
confint <- exp(confint(glm_bird)[1,])
confint
## 2.5 % 97.5 %
## 19.94989 22.81408
Therefore, we estimate that the expected average number of Phylloscopus collybita is between 19.95 and 22.81.
For a multiplicative interpretation of the effect use \(\text{exp}(\beta_1)\)
exp(coef(glm_bird)[2])
## SpeciesTurdus_merula
## 2.57377
Therefore, we estimate that the expected number of Turdus merula is 2.57 \(\times\) greater than Phylloscopus collybita.
For a percentage-change interpretation use \(100 \times \left (\text{exp}(\beta_1)−1 \right )\):
100*(exp(coef(glm_bird)[2]) - 1)
## SpeciesTurdus_merula
## 157.377
Therefore, we estimate that expected number of Turdus merula is 157.38% greater than Phylloscopus collybita.
Using 95% confidence intervals:
confint <- 100*(exp(confint(glm_bird)[2, ]) - 1)
confint
## 2.5 % 97.5 %
## 137.9259 178.6742
So, we estimate that expected number of Turdus merula is between 137.93% and 178.67% greater than Phylloscopus collybita.
We define a random variable, \(Y_i\), to have a binomial distribution if it is the number of successes from a number of independent trials, \(n\), each with the same probability of success, \(p\). It is a discrete distribution, which notes that the number of successes associated with the \(i^{th}\) observation must be an integer between \(0\) and \(n_i\). In addition, it builds in the non-constant variance of \(Y_i\) and \(\frac{Y_i}{n_i}\): \(\text{Var}(Y_i)=n_ip_i(1−p_i)\) and \(\text{Var}(\frac{Y_i}{n_i}) = \frac{p_i(1−p_i)}{n_i}\).
If we were to assume a linear relationship (as previously) that \(p_i = \alpha + \beta_1 x_i\) then we would be allowing \(p < 0\) and \(p>1\), which is not supported by the binomial distribution. So, we use a link function to map between \(p\) and the real number line:
\[\text{logit}(p_i) = \text{log}\left (\frac{p_i}{1 - p_i}\right ) = \alpha + \beta_1x_i.\] This leads to \[p_i = \frac{\exp(\alpha + \beta_1x_i)}{1 + \exp(\alpha + \beta_1x_i)}.\] and
\[Y_i \sim \text{Binomial}(n_i, p_i)\]
Recap, for probability \(p\) the odds are \(\frac{p}{1-p}\) and the log-odds are \(\text{log}\left (\frac{p}{1-p}\right )\):
Recall, for a linear regression model \(\mu = \alpha + \beta_1x\), when \(x=0\) \(y = \alpha\) and for every one-unit increase in \(x\), \(y\) increases by amount \(\beta_1\).
For a logistic regression model we have \[\begin{array}{rl} \text{logit}(p) = \alpha + \beta_1 x \\ \text{log}\left (\frac{p}{1-p}\right ) = \alpha + \beta_1 x \\ \text{log}\left (\text{odds}\right ) = \alpha + \beta_1 x \\ \end{array}\]
We can interpret this as when \(x = 0\), the log-odds of success are equal to \(\alpha\) and that for every one-unit increase in \(x\) the log-odds of success increase by \(\beta_1\). But, interpreting the effect of \(x\) on the log-odds of success is not straightforward.
Now, we have
\[\text{odds} = \text{e}^{ \alpha + \beta_1 x} = \text{e}^{ \alpha}(\text{e}^{\beta_1})^{ x}.\] This implies that
Typically, we use logistic regression to model if we have a binary, or proportional response (e.g., success vs. failure). But, how do we assess if our choice was appropriate? Using the deviance?
Formally, we can test the null hypothesis that the model is correct by calculating a p-value using \[ p = \Pr(\chi^2_{n - k} > D).\]
Conditions of the chi-squared approximation
The distribution of the deviance under the null hypothesis is approximately chi-squared if the response of each observation is well approximated by a normal distribution. This holds for binomial random variables if the number of trials, \(n_i\), is large enough:
However, if our chi-squared approximation assumptions are not met we should find another way.
Let us, again, consider data from the published article Influence of predator identity on the strength of predator avoidance responses in lobsters..
The authors were interested in how a juvenile lobster’s size was related to its vulnerability to predation. In total, 159 juvenile lobsters were collected from their natural habitat in the Gulf of Maine, USA, and the length of each lobster’s carapace (upper shell) was measured to the nearest 3 mm, size. The lobsters were then tethered to the ocean floor for 24 hours. Any missing lobsters were assumed to have been consumed by a predator, while the surviving lobsters were released (i.e., survived = 1 if lobster survived, 0 otherwise).
library(tidyverse)
data <- read_csv("https://raw.githubusercontent.com/STATS-UOA/databunker/master/data/lobster.csv")
data
## # A tibble: 159 × 2
## size survived
## <dbl> <dbl>
## 1 42 0
## 2 36 0
## 3 51 1
## 4 33 0
## 5 33 1
## 6 45 1
## 7 54 1
## 8 48 0
## 9 39 0
## 10 48 1
## # ℹ 149 more rows
Fitting a binomial model we specify family = "binomial" in our glm call. Note that the default link function for family = "binomial" is the logit link; we could also use the equivalent syntax binomial(link = logit)to specify this model.
glm_mod_ug <- glm(survived ~ size, family = "binomial", data = data)
summary(glm_mod_ug)
##
## Call:
## glm(formula = survived ~ size, family = "binomial", data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.89597 1.38501 -5.701 0.00000001191 ***
## size 0.19586 0.03415 5.735 0.00000000977 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 220.41 on 158 degrees of freedom
## Residual deviance: 172.87 on 157 degrees of freedom
## AIC: 176.87
##
## Number of Fisher Scoring iterations: 4
The fitted model is therefore
\[ \log\left[ \frac { \widehat{P( \operatorname{survived} = \operatorname{1} )} }{ 1 - \widehat{P( \operatorname{survived} = \operatorname{1} )} } \right] = -7.896 + 0.196(\operatorname{size}) \]
The data are currently ungrouped, despite many lobsters sharing the same carapace size. Therefore, we rearrange the data set so that it is grouped:
grouped <- data %>%
group_by(size) %>%
summarise(y = sum(survived), n = length(survived), p = mean(survived))
grouped
## # A tibble: 11 × 4
## size y n p
## <dbl> <dbl> <int> <dbl>
## 1 27 0 5 0
## 2 30 1 10 0.1
## 3 33 3 22 0.136
## 4 36 7 21 0.333
## 5 39 12 22 0.545
## 6 42 17 29 0.586
## 7 45 13 18 0.722
## 8 48 12 17 0.706
## 9 51 7 8 0.875
## 10 54 6 6 1
## 11 57 1 1 1
Where,
size is as above,y is the number of lobsters of each size that survived,t is the total number of lobsters of each size, andpis the proportion of lobsters of each size that survived.Fitting a binomial model again we specify family = "binomial" in our glm call and specify our response as cbind(y, n - y):
glm_mod_gr <- glm(cbind(y, n - y) ~ size, family = "binomial", data = grouped)
summary(glm_mod_gr)
##
## Call:
## glm(formula = cbind(y, n - y) ~ size, family = "binomial", data = grouped)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.89597 1.38501 -5.701 0.00000001191 ***
## size 0.19586 0.03415 5.735 0.00000000977 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 52.1054 on 10 degrees of freedom
## Residual deviance: 4.5623 on 9 degrees of freedom
## AIC: 32.24
##
## Number of Fisher Scoring iterations: 4
The fitted model is again
\[ \log\left[ \frac { \widehat{P( \operatorname{survived} = \operatorname{1} )} }{ 1 - \widehat{P( \operatorname{survived} = \operatorname{1} )} } \right] = -7.896 + 0.196(\operatorname{size}) \]
Interpreting the coefficients above we estimate that the log-odds of a juvenile lobster surviving are -8 (use your common sense to ascertain if interpreting the intercept is sensible).
We estimate that for every 1mm increase in carapace length the log-odds of a juvenile lobster surviving increase by 0.196.
What about the odds?
exp(coef(glm_mod_gr))
## (Intercept) size
## 0.0003722407 1.2163540760
Therefore, we estimate that the odds of a juvenile lobster surviving are 0.000372 (use your common sense to ascertain if interpreting the intercept is sensible).
We estimate that for every 1mm increase in carapace length the odds of a juvenile lobster surviving are multiplied by 1.22.
What about for a 5mm increase in carapace length?
exp(5*coef(glm_mod_gr)[2])
## size
## 2.662564
Therefore, we estimate that for every 5mm increase in carapace length the odds of a juvenile lobster surviving are multiplied by 2.66.
Using 95% confidence intervals:
ci <- exp(5*confint(glm_mod_gr)[2,])
ci
## 2.5 % 97.5 %
## 1.944478 3.811167
Therefore, we estimate that for every 5mm increase in carapace length the odds of a juvenile lobster surviving are multiplied by between 1.94 and 3.81.
For a percentage-change interpretation we use \(100 \times \left (\text{exp}(x\beta_1)−1 \right )\):
100*(exp(5*coef(glm_mod_gr)[2]) - 1)
## size
## 166.2564
confint <- 100*(exp(5*confint(glm_mod_gr)[2, ]) - 1)
confint
## 2.5 % 97.5 %
## 94.44778 281.11670
We estimate that for every 5mm increase in carapace length the odds of a juvenile lobster surviving increase by 166.26%. We estimate that for every 5mm increase in carapace length the odds of a juvenile lobster surviving increase between 94.45% and 281.12%.
Recall from module 2, where we looked at residual plots to check our assumptions for a linear model. In a similar way we can use residuals to check how appropriate our GLM is (i.e., to help diagnose the overall goodness-of-fit and to see if our model assumptions are met). Often, Pearson and deviance residuals are used in model diagnostics for generalized linear models.
Response residuals are the conventional residual on the response level. That is, the fitted residuals are transformed by taking the inverse of the link function. Think back to linear models where the link function was set as the identity.
Deviance residuals represent the contributions of individual samples to the deviance, D (see above).
Pearson residuals are calculated by normalizing the raw residuals (i.e., expected - estimate) by the square root of the estimate.
Luckily, we can calculate all three directly in R. Below we do that for the bird abundance model.
## response residuals
resids_response <- residuals(glm_bird, type = "response")
## deviance residuals
resids_deviance <- residuals(glm_bird, type = "deviance")
## pearson residuals
resids_pearson <- residuals(glm_bird, type = "pearson")
Plotting all three types we can asses how appropriate our chosen model is.
So, what do you see? Look back at the assumptions of a Poisson model. What do you conclude from the plots above?
Recall from Module 2 that a Normal quantile-quantile (Q-Q) plot is used to check overall similarity of the observed distribution of the residuals to that expected under the model (i.e., Gaussian). An alternative to a Normal Q-Q plot for a GLM fit is a quantile residual Q-Q plot of observed versus expected quantile residuals. Quantile residuals are, typically, the residuals of choice for GLMs when the deviance and Pearson residuals can be grossly non-normal. It is suggested that quantile residuals are the only useful residuals for binomial or Poisson data when the response takes on only a small number of distinct values. We can use the statmod::qresiduals() function to compute these residuals.
## extract the residual deviance
D <- glm_bird$deviance
D
## [1] 1621.948
## extract the residual degrees of freedom (n-k)
df <- glm_bird$df.residual
df
## [1] 78
Therefore, to test the relevant null hypothesis (that the model is correct) we use
1 - pchisq(D, df)
## [1] 0
We have strong evidence to reject the null hypothesis; suggesting a lack of fit! BUT are our chi-squared approximation assumptions met? If not, we might take a simulation based approach. This is, however, beyond the scope of this course.
## extract the residual deviance
D <- glm_mod_gr$deviance
D
## [1] 4.562321
## extract the residual degrees of freedom (n-k)
df <- glm_mod_gr$df.residual
df
## [1] 9
Therefore, to test the relevant null hypothesis (that the model is correct) we use
1 - pchisq(D, df)
## [1] 0.8706732
Here, we have no evidence to against our model being “correct”. BUT are our chi-squared approximation assumptions met? If not, we might take a simulation based approach. This is, however, beyond the scope of this course.
The three distributions we’ve covered are given below.
Linear regression: \(Y_i \sim \text{Normal}(\mu_i, \sigma^2)\) where \(\mu_i = \alpha + \beta_1 x_i\)
Poisson regression: \(Y_i \sim \text{Poisson}(\mu_i)\) where \(\text{log}(\mu_i) = \alpha + \beta_1 x_i\)
Logistic regression: \(Y_i \sim \text{Binomial}(n_i, p_i)\) where \(\text{logit}(p_i) = \alpha + \beta_1 x_i\)
What would happen if we wanted to add extra explanatory terms (e.g., \(z_i\))? Then,
Linear regression: \(Y_i \sim \text{Normal}(\mu_i, \sigma^2)\) where \(\mu_i = \alpha + \beta_1 x_i + \beta_2 z_i\)
Poisson regression: \(Y_i \sim \text{Poisson}(\mu_i)\) where \(\text{log}(\mu_i) = \alpha + \beta_1 x_i+ \beta_2 z_i\)
Logistic regression: \(Y_i \sim \text{Binomial}(n_i, p_i)\) where \(\text{logit}(p_i) = \alpha + \beta_1 x_i+ \beta_2 z_i\)
What about interactions (e.g., \(x_iz_i\))?
Linear regression: \(Y_i \sim \text{Normal}(\mu_i, \sigma^2)\) where \(\mu_i = \alpha + \beta_1 x_i + \beta_2 z_i + \beta_3 x_iz_i\)
Poisson regression: \(Y_i \sim \text{Poisson}(\mu_i)\) where \(\text{log}(\mu_i) = \alpha + \beta_1 x_i+ \beta_2 z_i + \beta_3 x_iz_i\)
Logistic regression: \(Y_i \sim \text{Binomial}(n_i, p_i)\) where \(\text{logit}(p_i) = \alpha + \beta_1 x_i+ \beta_2 z_i + \beta_3 x_iz_i\)
Assume the observations are independent of one another, then,
Choose a distribution for the response. For example, Normal, Poisson, or Binomial.
Choose a parameter to relate to explanatory terms. For example, \(\mu_i\), \(\mu_i\), or \(p_i\).
Choose a link function. For example, identity, log, or logit.
Choose explanatory terms
Estimate additional parameters. For example, \(\sigma^2\).
We are not restricted to the three distributions above. Many others exist:
| Distribution | Notation | Mean \(\mu = E(Y)\) | Variance \(\sigma^2 = \text{Var}(Y)\) | Linear predictor (w. a typical link function) |
|---|---|---|---|---|
| Gaussian | \(Y\sim\textbf{Normal}(\mu, \sigma^2)\) | \(\mu\) | \(\sigma^2\) | I\((\mu) = \alpha + \Sigma_{j=1}^{n_\text{covariates}}\beta_j x_j\) |
| Poisson | \(Y \sim \textbf{Poisson}(\mu)\) where \(\mu = \text{rate}\) | \(\mu\) | \(\mu\) | log\((\mu) = \alpha + \Sigma_{j=1}^{n_\text{covariates}}\beta_j x_j\) |
| Binomial | \(Y \sim \textbf{Bonomial}(\text{n}, p)\) where \(\text{n} = \text{number of trials}\) and \(p = \text{probability of sucess}\) | \(\text{n}p\) | \(\text{n}p (1-p)\) | logit\((p) = \alpha + \Sigma_{j=1}^{n_\text{covariates}}\beta_j x_j\) |
| Gamma | \(Y \sim \textbf{Gamma}(k, \theta = \frac{1}{\text{rate}})\) where \(k = \text{shape}\) and \(\theta = \text{scale}\) | \(k\theta\) | \(k\theta^2\) | log\((E(Y)) = \alpha + \Sigma_{j=1}^{n_\text{covariates}}\beta_j x_j\) |
| Beta | \(Y \sim \textbf{Beta}(a, b)\) where \(a = \text{shape}\) and \(b = \text{scale}\) | \(\frac{a}{a+b}\) | \(\frac{ab}{(a+b)^2(a+b+1)}\) | log\((E(Y)) = \alpha + \Sigma_{j=1}^{n_\text{covariates}}\beta_j x_j\) |
| Negative binomial | \(Y \sim \textbf{NB}(r, p)\) where \(\text{r} = \text{number of successes until the experiment is stopped}\) and \(p = \text{probability of sucess}\) or \(Y \sim \textbf{NB}(k, p)\) where \(\text{k} = \text{number of failures}\) given \(p = \text{probability of sucess}\) | \(\frac{r(1-p)}{p}\) or \(\mu = k\frac{p}{1-p}\) | \(\frac{r(1-p)}{p^2}\) or \(\mu + \frac{\mu^2}{k}\) | log\((E(Y)) = \alpha + \Sigma_{j=1}^{n_\text{covariates}}\beta_j x_j\) |
| Beta-binomial | \(Y \sim \textbf{BetaBin}(\text{n},a, b)\) where \(\text{n} = \text{number of trials}\) and \(p = \frac{a}{a + b}, \text{the probability of success}\) | \(\frac{\text{n} a}{a+b}= \text{n}p\) | \(\frac{\text{n} a b(a + b + \text{n})}{(a+b)^2(a + b + 1)}\) | logit\((p) = \alpha + \Sigma_{j=1}^{n_\text{covariates}}\beta_j x_j\) |
\(^{+}\) Note that the Quasi-… distributions are not really proper distributions; they are simply models describing a mean-variance relationship beyond that offered by the base distribution.
R function |
Use |
|---|---|
gam() |
Fit a generalised additive model. The R package mgcv must be loaded |
nlme() |
Fit linear and non-linear mixed effects models. The R package nlme must be loaded |
gls() |
Fit generalised least squares models. The R package nlme must be loaded |
Recall module 3 when we covered fitting linear models with random effects (lmer). The fixed effects and random effects were specified via the model formula. Now we’ve covered GLMs we can include random effects here too and fit GLMMs!
Recall
The authors of this paper transplanted gut microbiota from human donors with Autism Spectrum Disorder (ASD) or typically developing (TD) controls into germ-free mice. Faecal samples were collected from three TD and five ASD donors and were used to colonise GF male and female mice from strain C57BL/6LJ. Individuals colonized by the same donor were allowed to breed. Adult offspring mice were behavior tested; tests included marble burying (MB), open-field testing (OFT), and ultrasonic vocalization (USV).
library(tidyverse)
data <- read_csv("https://raw.githubusercontent.com/STATS-UOA/databunker/master/data/autism.csv")
data
## # A tibble: 206 × 3
## Donor Treatment MB_buried
## <chr> <chr> <dbl>
## 1 C1 NT Female 3
## 2 C1 NT Female 8
## 3 C1 NT Female 1
## 4 C1 NT Female 3
## 5 C1 NT Male 2
## 6 C1 NT Male 6
## 7 C1 NT Male 4
## 8 C1 NT Male 2
## 9 C1 NT Female 9
## 10 A24-new ASD Male 7
## # ℹ 196 more rows
We should separate out the treatment values:
mice <- data %>%
separate(., col = Treatment, into = c("Diagnosis", "Sex"))
mice
## # A tibble: 206 × 4
## Donor Diagnosis Sex MB_buried
## <chr> <chr> <chr> <dbl>
## 1 C1 NT Female 3
## 2 C1 NT Female 8
## 3 C1 NT Female 1
## 4 C1 NT Female 3
## 5 C1 NT Male 2
## 6 C1 NT Male 6
## 7 C1 NT Male 4
## 8 C1 NT Male 2
## 9 C1 NT Female 9
## 10 A24-new ASD Male 7
## # ℹ 196 more rows
Recall, using lmer to fit a LMM
lmer <- lmerTest::lmer(MB_buried ~ Sex * Diagnosis + (1|Donor), data = mice)
car::Anova(lmer, type = 2)
## Analysis of Deviance Table (Type II Wald chisquare tests)
##
## Response: MB_buried
## Chisq Df Pr(>Chisq)
## Sex 3.6329 1 0.056648 .
## Diagnosis 8.6944 1 0.003192 **
## Sex:Diagnosis 2.9745 1 0.084588 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
But is the constant error variance appropriate? Below we plot the model residuals.
glmerglmer <- lme4::glmer(MB_buried ~ Sex * Diagnosis + (1|Donor), data = mice, family = poisson(link = "log"))
car::Anova(glmer, type = 2)
## Analysis of Deviance Table (Type II Wald chisquare tests)
##
## Response: MB_buried
## Chisq Df Pr(>Chisq)
## Sex 9.9452 1 0.0016127 **
## Diagnosis 10.5836 1 0.0011410 **
## Sex:Diagnosis 12.5904 1 0.0003877 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Learning objectives
R code to carry out hierarchical and k-means cluster analysisR output from hierarchical and k-means cluster analysisR code to carry out PCAOther resources
“Clusters may be described as continuous regions of (a) space containing a relatively high density of points, separated from other such regions by regions containing a relatively low density of points.”
“Cluster analysis has the apparently simple aim of finding clusters in a data cloud of sampling units in the absence of any a priori information about which point belongs in which cluster. This apparently unambitious aim is unfortunately fraught with problems.”
In brief, cluster analysis involves using measures of (dis)similarity and distances to help us define clusters. We use this to uncover hidden or latent clustering by partitioning the data into tighter sets. There are two main methods for doing this: 1) divisive methods use nonparametric algorithms (such as k-means) to split data into a small number of clusters, and 2) agglomerative methods that cluster cases and/or variables into a hierarchy of sets (e.g., hierarchical clustering). We can use to resampling-based bootstrap methods validate clusters.
For a single run of a partitioning method, the number of clusters ( \(k\) ) is typically fixed beforehand. Typically, there are only two steps to a partitioning method:
The initial allocation is usually started by choosing \(k\) sampling units to use as seeds to set the clusters. There are a number of ways to choose these seeds. These seeds are used as the initial centres of the clusters, points are allocated to the nearest cluster centre, and in most programs the cluster centroid is adjusted as they are added.
K-means clustering involves defining clusters so that the overall variation within a cluster (known as total within-cluster variation) is minimized. How do we define this variation? Typically, using Euclidean distances; the total within-cluster variation, is in this case, is defined as the sum of squared distances Euclidean distances between observations and the corresponding cluster centroid.
In summary, the k-means procedure is
Identifying the appropriate \(k\) is important because too many or too few clusters impedes viewing overall trends. Too many clusters can lead to over-fitting (which limits generalizations) while insufficient clusters limits insights into commonality of groups.
There are assorted methodologies to identify the appropriate \(k\). Tests range from blunt visual inspections to robust algorithms. The optimal number of clusters is ultimately a subjective decision.
palmerpenguins data
library(palmerpenguins)
## getting rid of NAs
penguins_nafree <- penguins %>% drop_na()
## introducing a new package GGally, please install
## using install.packages("GGally")
library(GGally)
penguins_nafree %>%
dplyr::select(species, where(is.numeric)) %>%
ggpairs(columns = c("flipper_length_mm", "body_mass_g",
"bill_length_mm", "bill_depth_mm"))
We see that a lot of these variables (e.g., flipper_length_mm, body_mass_g, and bill_length_mm) are relatively strongly (positively) related to one another. Could they actually be telling us the same information? Combined we could think of these three variables all telling us a little about bigness of penguin. Is there a way we could reduce these three variables, into say 1, to represent the bigness of a penguin. We may not need all the information (variation) captured by these variables, but could get away with fewer new uncorrelated variables that represent basically the same information (e.g., penguin bigness), thereby, reducing the dimensionality of the data (more on this later).
## create a data frame of what we're interested in
df <- penguins_nafree %>%
dplyr::select(where(is.numeric), -year)
We use the kmeans() function.
The first argument of kmeans() should be the dataset you wish to cluster. Below we use data frame df, the penguin data discussed above. But how many clusters do we choose? Let’s try 1 to 5… (i.e., using the centers argument). Setting nstart = 25 means that R will try 25 different random starting assignments and then select the best results corresponding to the one with the lowest within cluster variation.
## set the seed so we all start off in the same place
set.seed(4321)
## one cluster
k1 <- kmeans(df, centers = 1, nstart = 25)
## two clusters
k2 <- kmeans(df, centers = 2, nstart = 25)
## three clusters
k3 <- kmeans(df, centers = 3, nstart = 25)
## four clusters
k4 <- kmeans(df, centers = 4, nstart = 25)
## five clusters
k5 <- kmeans(df, centers = 5, nstart = 25)
The kmeans() function returns a list of components:
cluster, integers indicating the cluster to which each observation is allocatedcenters, a matrix of cluster centers/meanstotss, the total sum of squareswithinss, within-cluster sum of squares, one component per clustertot.withinss, total within-cluster sum of squaresbetweenss, between-cluster sum of squaressize, number of observations in each clusterWe have an idea there may be 3 clusters, perhaps, but how do we know this is the best fit? Remember it’s a subjective choice and we’ll be looking at a few pointers
Visual inspection method
library(factoextra) ## a new package for kmeans viz, please install
p1 <- fviz_cluster(k1, data = df)
p2 <- fviz_cluster(k2, data = df)
p3 <- fviz_cluster(k3, data = df)
p4 <- fviz_cluster(k4, data = df)
p5 <- fviz_cluster(k5, data = df)
## for arranging plots
library(patchwork)
(p1| p2| p3)/ (p4 | p5)
Alternatively, you can use standard pairwise scatter plots to illustrate the clusters compared to the original variables.
df %>%
mutate(cluster = k3$cluster,
species = penguins_nafree$species) %>%
ggplot(aes(flipper_length_mm, bill_depth_mm, color = factor(cluster), label = species)) +
geom_text()
Elbow method
Optimal clusters are at the point in which the knee “bends” or in mathematical terms when the marginal total within sum of squares (tot.withinss) for an additional cluster begins to decrease at a linear rate
This is easier to see via a plot:
fviz_nbclust(df, kmeans, method = "wss") +
labs(subtitle = "Elbow method")
There is a pretty obvious inflection (elbow) at 2 clusters, but maybe at 3 too. We can rule out an optimal number of clusters above 3 as there is then only a minimal marginal reduction in total within sum of squares. However, the model is ambiguous on whether 2 or 3 clusters is optimal…
Silhouette method
# Silhouette method
fviz_nbclust(df, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette method")
Gap method
# Gap statistic
# recommended value: nboot = 500 for your analysis (it will take a while)
set.seed(123) ## remove this
fviz_nbclust(df, kmeans, nstart = 25, method = "gap_stat", nboot = 50)+
labs(subtitle = "Gap statistic method")
Basically it’s up to you to collate all the suggestions and make and informed decision
## Trying all the cluster indecies AHHHHH
library(NbClust)
cluster_30_indexes <- NbClust(data = df, distance = "euclidean", min.nc = 2, max.nc = 9, method = "complete", index ="all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 5 proposed 2 as the best number of clusters
## * 6 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 4 proposed 5 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 3 proposed 9 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
fviz_nbclust(cluster_30_indexes) +
theme_minimal() +
labs(title = "Frequency of Optimal Clusters using 30 indexes in NbClust Package")
## Error in if (class(best_nc) == "numeric") print(best_nc) else if (class(best_nc) == : the condition has length > 1
Not obvious, basically still undecided between 2 and 3, but according to the absolute majority rule the “best” number is 3
Most of the hierarchical methods are agglomerative and they operate in the same way:
So, the close clusters are fused first, then those further apart, till all have been fused. This process allows the history of the fusions, the hierarchy, to be displayed as a dendrogram. This is an advantage of the agglomerative methods, if the data have a nested structure these techniques lead to a useful way of displaying it.
Unlike the k-means most of the agglomerative techniques can use a broad range of similarity or distance measures. This, however, means that considerable care must be taken to choose the appropriate one; different measures often lead to different results.
Single Linkage (nearest neighbour/minimal jump): Computes the distance between clusters as the smallest distance between any two points in the two clusters.
Single Linkage identifies clusters based on how far apart they are at their closest points. This means that if there are any intermediate points then single linkage will fuse the groups without leaving any trace of their separate identities. This is called chaining and it often leads to uninformative dendrograms.
If, however, the clusters are well separated in the data, then single linkage can handle groups of different shapes and sizes easily. In addition, single linkage will give the same clustering after any monotonic transformation of the distance measure (i.e., it is fairly robust to the choice of measure).
Instead of measuring the distance between two clusters as that between their two nearest members complete Linkage (maximum jump) uses that between the two farthest members (i.e., it calculates the maximum distance between two points from each cluster.)
The resulting clusters are often compact, spherical and well defined. Complete linkage can, however, be sensitive to tied distances. Although, it too, is robust to a certain amount of measurement error and choice of distance.
Group average linkage (UPGMA) is probably the most popular hierarchical clustering method! You might like to think of it as an attempt to avoid the extremes of the single and complete linkage methods as the distance between two clusters is the average of the distances between the members of the two groups. As a result this method tends to produce compact spherical clusters.
Ward’s method is the hierarchical version of the k-means partitioning method. At each fusion it attempts to minimise the increase in total sum of squared distances within the clusters. This is equivalent to minimising the sum of squared within cluster deviations from the centroids
Ward’s method, at any one stage, can only fuse those clusters already in existence (i.e., it is not allowed to reallocate points). A bad start to the agglomeration process can place the algorithm on a path from which it can never reach the global optimum for a given number of clusters. Its chief flaw is a tendency to form clusters of equal size, regardless of the true number. Like the complete linkage and group average methods it is also biased towards forming spherical clusters. Despite this, Ward’s method performs well in simulations and is often method of choice.
In summary
| Method | Pros | Cons |
|---|---|---|
| Single linkage | number of clusters | comb-like trees. |
| Complete linkage | compact clusters | one obs. can alter groups |
| Average linkage | similar size and variance | not robust |
| Centroid | robust to outliers | smaller number of clusters |
| Ward | minimising an inertia | clusters small if high variability |
Data were collected on the distribution of ant species at 30 sites across the Auckland region using pitfall traps. Twenty pitfall traps at each site were left open for ten days and the number of individuals captured counted for the four most abundant species: Nylanderia spp, Pheidole rugosula, Tetramorium grassii, and Pachycondyla sp.
library(tidyverse)
ants <- read_csv("https://raw.githubusercontent.com/STATS-UOA/databunker/master/data/pitfalls.csv")
ants
## # A tibble: 30 × 8
## Location Habitat Month Site Nyl Phe Tet Pac
## <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 West Forest 1 WF1 0 0 0 157
## 2 West Grass 1 WG1 0 2 7 37
## 3 West Urban 1 WU1 3 7 0 0
## 4 West Forest 2 WF2 0 0 0 31
## 5 West Grass 2 WG2 5 0 25 0
## 6 West Forest 3 WF3 0 0 0 21
## 7 West Grass 3 WG3 0 3 2 1
## 8 West Urban 3 WU3 0 1 0 0
## 9 Central Forest 1 CF1 0 0 0 1
## 10 Central Grass 1 CG1 0 3 22 2
## # ℹ 20 more rows
Data are species counts, so we will use Bray Curtis measure:
pitfall.dist <- vegan::vegdist(ants[,5:8], method = "bray", binary = FALSE)
factoextra::fviz_dist(pitfall.dist)
single <- ants[,5:8] %>%
vegan::vegdist(., method = "bray", binary = FALSE) %>%
hclust(method = "single")
plot(single, labels = ants$Site)
complete <- ants[,5:8] %>%
vegan::vegdist(., method = "bray", binary = FALSE) %>%
hclust(method = "complete")
plot(complete, labels = ants$Site)
average <- ants[,5:8] %>%
vegan::vegdist(., method = "bray", binary = FALSE) %>%
hclust(method = "average")
plot(average, labels = ants$Site)
ward <- ants[,5:8] %>%
vegan::vegdist(., method = "bray", binary = FALSE) %>%
hclust(method = "ward.D")
plot(ward, labels = ants$Site)
Using the function cutree() to split into clusters and plot:
ants$clust4 <- cutree(ward, k = 4)
library(ape) ## install
pitfall.phylo <- as.phylo(ward)
pitfall.phylo$tip.label <- ants$Site
## Set colours
colours <- c("red","blue","green","black")
plot(pitfall.phylo, cex = 0.6, tip.color = colours[ants$clust4], label.offset = 0.05)
Reduction of dimensions is needed when there are far too many features in a dataset, making it hard to distinguish between the important ones that are relevant to the output and the redundant or not-so important ones. Reducing the dimensions of data is called dimensionality reduction.
So the aim is to find the best low-dimensional representation of the variation in a multivariate (lots and lots of variables) data set, but how do we do this?
PCA is a member of a family of techniques for dimension reduction (ordination).
The word ordination was applied to dimension reduction techniques by botanical ecologists whose aim was to identify gradients in species composition in the field. For this reason they wanted to reduce the quadrat × species (observations × variables) data matrix to a single ordering (hence ordination) of the quadrats which they hoped would reflect the underlying ecological gradient.
One way is termed Principal Component Analysis (PCA). PCA is a feature extraction method that reduces the dimensionality of the data (number of variables) by creating new uncorrelated variables while minimizing loss of information on the original variables. It is a technique for the analysis of an unstructured sample of multivariate data. Its aim is to display the relative positions of the observations in the data cloud in fewer dimensions (while losing as little information as possible) and to help give insight into the way the observations vary. It is not a hypothesis testing technique (like t-test or Analysis of Variance); it is an exploratory, hypothesis generating tool that describes patterns of variation, and suggests relationships that should be investigated further.
Think of a baguette. The baguette pictured here represents two data dimensions: 1) the length of the bread and 2) the height of the bread (we’ll ignore depth of bread for now). Think of the baguette as your data; when we carry out PCA we’re rotating our original axes (x- and y-coordinates) to capture as much of the variation in our data as possible. This results in new uncorrelated variables that each explain a % of variation in our data; the procedure is designed so that the first new variable (PC1) explains the most, the second (PC2) the second most and so on.
Now rather than a baguette think of data; the baguette above represent the shape of the scatter between the two variables plotted below. The rotating grey axes represent the PCA procedure, essentially searching for the best rotation of the original axes to represent the variation in the data as best it can. Mathematically the Euclidean distance (e.g., the distance between points \(p\) and \(q\) in Euclidean space, \(\sqrt{(p-q)^2}\)) between the points and the rotating axes is being minimized (i.e., the shortest possible across all points), see the blue lines. Once this distance is minimized across all points we “settle” on our new axes (the black tiled axes).
Luckily we can do this all in R!
One problem with PCA is that the components are not scale invariant. That means if we change the units in which our variables are expressed, we change the components; and not in any simple way either. So, every scaling or adjustment of the variables in preparation for the analysis could (and usually does ) produce a separate component structure. It is therefore important to choose a standardisation or transformation carefully: PCA will give different results depending on whether we analyse the covariance matrix, where the data have merely been centred (corrected for the column, variable, mean), or the correlation matrix, where the data have been standardised to z -scores (centred and converted into standard deviation units). This is particularly important, as many computer programs to do PCA automatically analyse the correlation matrix . If you do not want that standardisation; you may have to explicitly ask for the covariance matrix. As you would expect, the results from the two analyses will usually be very different, see below.
Rpalmerpenguins data## getting rid of NAs
penguins_nafree <- penguins %>% drop_na()
When carrying out PCA we’re only interested in numeric variables, so let’s just plot those. We can use the piping operator %>% to do this with out creating a new data frame
library(GGally)
penguins_nafree %>%
dplyr::select(species, where(is.numeric)) %>%
ggpairs(aes(color = species),
columns = c("flipper_length_mm", "body_mass_g",
"bill_length_mm", "bill_depth_mm"))
Using prcomp()
There are three basic types of information we obtain from Principal Component Analysis:
PC scores: the coordinates of our samples on the new PC axis: the new uncorrelated variables (stored in pca$x)
Eigenvalues: (see above) represent the variance explained by each PC; we can use these to calculate the proportion of variance in the original data that each axis explains
Variable loadings (eigenvectors): these reflect the weight that each variable has on a particular PC and can be thought of as the correlation between the PC and the original variable
Before we carry out PCA we should scale out data. WHY?
pca <- penguins_nafree %>%
dplyr::select(where(is.numeric), -year) %>% ## year makes no sense here so we remove it and keep the other numeric variables
scale() %>% ## scale the variables
prcomp()
## print out a summary
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.6569 0.8821 0.60716 0.32846
## Proportion of Variance 0.6863 0.1945 0.09216 0.02697
## Cumulative Proportion 0.6863 0.8809 0.97303 1.00000
This output tells us that we obtain 4 principal components, which are called PC1 PC2, PC3, and PC4 (this is as expected because we used the 4 original numeric variables!). Each of these PCs explains a percentage of the total variation (Proportion of Variance) in the dataset:
PC1 explains \(\sim\) 68% of the total variance, which means that just over half of the information in the dataset
(5 variables) can be encapsulated by just that one Principal Component.PC2 explains \(\sim\) 19% of the variance.PC3 explains \(\sim\) 9% of the variance.PC4 explains \(\sim\) 2% of the variance.From the Cumulative Proportion row we see that by knowing the position of a sample in relation to just PC1 and PC2 we can get a pretty accurate view on where it stands in relation to other samples, as just PC1 and PC2 explain 88% of the variance.
The loadings (relationship) between the initial variables and the principal components are stored in pca$rotation:
pca$rotation
## PC1 PC2 PC3 PC4
## bill_length_mm 0.4537532 -0.60019490 -0.6424951 0.1451695
## bill_depth_mm -0.3990472 -0.79616951 0.4258004 -0.1599044
## flipper_length_mm 0.5768250 -0.00578817 0.2360952 -0.7819837
## body_mass_g 0.5496747 -0.07646366 0.5917374 0.5846861
Here we can see that bill_length_mm has a strong positive relationship with PC1, whereas bill_depth_mm has a strong negative relationship. Both fliper_length_mm and body_mass_g also have a strong positive relationship with PC1.
Plotting this we get
The new variables (PCs) are stored in pca$x, lets plot some of them alongside the loadings using a biplot. For PC1 vs PC2:
library(factoextra) ## install this package first
fviz_pca_biplot(pca, geom = "point") +
geom_point (alpha = 0.2)
Now for PC2 vs PC3
fviz_pca_biplot(pca, axes = c(2,3),geom = "point") +
geom_point (alpha = 0.2)
But how many PCs (new variables) do we keep? The whole point of this exercise is to reduce the number of variables we need to explain the variation in our data. So how many of these new variables (PCs) do we keep?
To assess this we can use the information printed above alongside a screeplot:
fviz_screeplot(pca)
Principal components from the original variables
Recall that the principal components are a linear combination of the (statndardised) variables. So for PC1
loadings1 <- pca$rotation[,1]
loadings1
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 0.4537532 -0.3990472 0.5768250 0.5496747
Therefore, the first Principle Component will be \(0.454\times Z1 -0.399 \times Z2 + 0.5768 \times Z3 + 0.5497 \times Z3\) where \(Z1\), \(Z2\), \(Z3\). and \(Z4\) are the scaled numerical variables form the penguins dataset (i.e., bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g). To compute this we use R:
scaled_vars <- penguins_nafree %>%
dplyr::select(where(is.numeric), -year) %>%
scale() %>%
as_tibble()
## By "Hand"
by_hand <- loadings1[1]*scaled_vars$"bill_length_mm" +
loadings1[2]*scaled_vars$"bill_depth_mm" +
loadings1[3]*scaled_vars$"flipper_length_mm" +
loadings1[4]*scaled_vars$"body_mass_g"
## From PCA
pc1 <- pca$x[,1]
plot(by_hand,pc1)
set.seed(1234) ## just for reproduciblity
noise <- as_tibble(replicate(10,rnorm(200, mean = 50, sd = 10)))
noise
## # A tibble: 200 × 10
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 37.9 54.9 37.7 53.1 39.8 37.9 36.0 29.9 47.5 44.7
## 2 52.8 57.0 50.4 56.1 36.1 53.0 73.8 34.4 37.8 45.0
## 3 60.8 51.9 45.8 33.1 49.5 34.6 58.7 69.6 49.3 38.5
## 4 26.5 57.0 41.0 57.8 68.1 56.4 34.6 48.2 63.6 51.3
## 5 54.3 53.1 54.2 50.1 49.0 57.0 61.3 63.6 45.4 54.7
## 6 55.1 57.6 51.5 48.2 57.8 30.9 60.4 66.5 50.7 48.1
## 7 44.3 68.4 64.6 61.1 39.0 59.4 55.9 51.2 57.8 33.8
## 8 44.5 61.1 38.8 64.8 47.8 47.8 54.0 50.0 34.1 46.0
## 9 44.4 50.3 44.8 38.5 55.7 43.3 51.4 52.0 36.3 54.1
## 10 41.1 38.9 49.3 60.1 46.5 54.5 56.2 56.8 40.7 42.8
## # ℹ 190 more rows
corrplot::corrplot(cor(noise), method = "ellipse",type = "upper")
pca <- noise %>%
scale() %>%
prcomp()
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.1890 1.1331 1.0936 1.0281 1.0075 0.96654 0.94742
## Proportion of Variance 0.1414 0.1284 0.1196 0.1057 0.1015 0.09342 0.08976
## Cumulative Proportion 0.1414 0.2698 0.3894 0.4951 0.5966 0.68998 0.77974
## PC8 PC9 PC10
## Standard deviation 0.90017 0.85665 0.81141
## Proportion of Variance 0.08103 0.07339 0.06584
## Cumulative Proportion 0.86078 0.93416 1.00000
fviz_screeplot(pca, choice = "eigenvalue") +
geom_hline(yintercept = 1)
What does this show?
Multidimensional scaling (MDS) is an extended family of techniques that try to reproduce the relative positions of a set of points in a reduced space given, not the points themselves, but only a matrix with interpoint distances ( dissimilarities ). These distances might be measured with error, or even be non-Euclidean.
Metric scaling tries to produce a set of coordinates (a configuration of points) in a reduced number of dimensions whose matrix of interpoint Euclidean distances approximates the original dissimilarity matrix as closely as possible. Principal coordinates (PCO) is one metric scaling technique (it is sometimes called classical or Torgerson scaling ).
RConsider data which are not represented as points in a feature space:
library(tidyverse)
ni <- read_csv("https://raw.githubusercontent.com/STATS-UOA/databunker/master/data/north_islands_distances.csv")
ni <- ni %>% column_to_rownames(var = "...1")
library(pheatmap)
pheatmap(ni, cluster_rows = TRUE,
treeheight_row = 2, treeheight_col = 2,
fontsize_row = 12, fontsize_col = 12,
cellwidth = 26, cellheight = 26)
mds <- cmdscale(ni, eig = TRUE)
mds
## $points
## [,1] [,2]
## Auckland 259.23245 67.43013
## Gisborne -107.54173 -285.70950
## Hamilton 129.07943 42.71295
## Hastings -173.12950 -25.15974
## Napier -150.83765 -34.70680
## Rotorua 37.39858 -18.39760
## Tauranga 118.78535 -85.88683
## Whanganui -192.73988 181.50600
## Wellington -385.83172 167.76477
## Whakatane 49.93256 -140.17112
## Whangarei 415.65212 130.61774
##
## $eig
## [1] 5.249373e+05 1.953521e+05 4.217767e+04 1.872276e+04 1.222717e+03
## [6] 2.910383e-11 -1.399691e+02 -4.733140e+02 -1.103819e+04 -1.883151e+04
## [11] -2.462990e+04
##
## $x
## NULL
##
## $ac
## [1] 0
##
## $GOF
## [1] 0.8600209 0.9206005
Data may from objects for which we have similarities but no underlying (geometric) space. Here the goal is to understand the underlying dimensionality of colour perception.
library(tidyverse)
ekman <- read_csv("https://raw.githubusercontent.com/STATS-UOA/databunker/master/data/ekman.csv")
ekman.mds <- cmdscale(ekman, eig = TRUE)
ekman.mds
round(ekman.mds$eig,2)
autoplot(ekman.mds)
library(ggfortify)
## Plotting Multidimensional Scaling (for interest)
## stats::cmdscale performs Classical MDS
data("eurodist") ## road distances (in km) between 21 cities in Europe.
autoplot(eurodist)
## Plotting Classical (Metric) Multidimensional Scaling
autoplot(cmdscale(eurodist, eig = TRUE))
autoplot(cmdscale(eurodist, eig = TRUE), label = TRUE, shape = FALSE,
label.size = 3)
## Plotting Non-metric Multidimensional Scaling
## MASS::isoMDS and MASS::sammon perform Non-metric MDS
library(MASS)
autoplot(sammon(eurodist))
autoplot(sammon(eurodist), shape = FALSE, label = TRUE,label.size = 3)
## Have a go at interpreting these plots based on the geography of the cities :-)
CA is a special case of metric MDS where the distance measure is the chi-square distance. It is conceptually similar to principal component analysis but where the data are categorical, counts, rather than continuous. CA is traditionally applied to contingency tables where rows and columns are treated equivalently; it decomposes the chi-square statistic associated with this table into orthogonal factors. Correspondence analysis is usually the best way to follow up on a significant chi-square test.
HairEyeColor
## , , Sex = Male
##
## Eye
## Hair Brown Blue Hazel Green
## Black 32 11 10 3
## Brown 53 50 25 15
## Red 10 10 7 7
## Blond 3 30 5 8
##
## , , Sex = Female
##
## Eye
## Hair Brown Blue Hazel Green
## Black 36 9 5 2
## Brown 66 34 29 14
## Red 16 7 7 7
## Blond 4 64 5 8
HC.df <- as.data.frame.matrix(HairEyeColor[ , , 2])
HC.df
## Brown Blue Hazel Green
## Black 36 9 5 2
## Brown 66 34 29 14
## Red 16 7 7 7
## Blond 4 64 5 8
chisq.test(HC.df)
##
## Pearson's Chi-squared test
##
## data: HC.df
## X-squared = 106.66, df = 9, p-value < 2.2e-16
library(ade4)
coaHC <- dudi.coa(HC.df, scannf = FALSE, nf = 2)
The first axis shows a contrast between black haired and blonde haired students, mirrored by the brown eye, blue eye contrast. In CA the two categories, rows and columns play symmetric roles and we interpret the proximity of Blue eyes and Blond hair as showing strong co-occurence of these categories.
Biplot barycentric scaling
Multidimensional scaling aims to minimize the difference between the squared distances \(D^2\) from the distance matrix \(D\), and the squared distances between the points with their new coordinates. Unfortunately, this objective tends to be sensitive to outliers: one single data point with large distances to everyone else can dominate, and thus skew, the whole analysis.
Under certain circumstances trying to preserve the actual dissimilarities might be too restrictive or even pointless. For example if there is large error in the dissimilarity estimates, if the dissimilarities or the data they were based on were ranks (ordinal), then the magnitude of the distances are too crude to be worth preserving. A method that preserved only the rank order of the dissimilarities would be more appropriate. The algorithm to do this is virtually the same as the one given above for metric scaling. The sole difference is that the linear regression that fitted the estimated distances for the solution to the dissimilarities is now replaced with an order preserving regression.
Robust ordination, or non-metric multidimensional scaling (NMDS), attempts to embed the points in a new space such that the order of the reconstructed distances in the new map is the same as the ordering of the original distance matrix. NMDS looks for a transformation f() of the given dissimilarities, distances d. The quality of the approximation can be measured by the standardized residual sum of
squares (STRESS) function: \(\text{Stress}^2 = \frac{\Sigma(f(d) - \tilde{d})^2}{\Sigma d^2}\) where \(f(d)\approx \tilde{d}\).
NMDS is not sequential:
RUse the function metaMDS from the vegan package; metaMDS performs NMDS, and tries to find a stable solution using several random starts. In addition, it standardizes the scaling in the result, so
that the configurations are easier to interpret.
Illustration with k = 2
library(vegan)
nMDS.2 <- replicate(100, metaMDS(ekman, k = 2, autotransform = FALSE))
stressplot(metaMDS(ekman, k = 2, autotransform = FALSE), pch = 20, cex = 2)
## Run 0 stress 0.2705898
## Run 1 stress 0.2711545
## Run 2 stress 0.2739054
## Run 3 stress 0.2634663
## ... New best solution
## ... Procrustes: rmse 0.2448256 max resid 0.4517853
## Run 4 stress 0.265124
## Run 5 stress 0.261966
## ... New best solution
## ... Procrustes: rmse 0.1819642 max resid 0.5616079
## Run 6 stress 0.2752863
## Run 7 stress 0.2634591
## Run 8 stress 0.2664506
## Run 9 stress 0.2658195
## Run 10 stress 0.2655221
## Run 11 stress 0.2683486
## Run 12 stress 0.2682856
## Run 13 stress 0.272819
## Run 14 stress 0.2611536
## ... New best solution
## ... Procrustes: rmse 0.1314703 max resid 0.2913875
## Run 15 stress 0.278313
## Run 16 stress 0.3000136
## Run 17 stress 0.2702805
## Run 18 stress 0.2770341
## Run 19 stress 0.2713099
## Run 20 stress 0.2635924
## *** Best solution was not repeated -- monoMDS stopping criteria:
## 1: no. of iterations >= maxit
## 18: stress ratio > sratmax
## 1: scale factor of the gradient < sfgrmin
LDA is a supervised learning technique: The main goal is to predict some feature of interest using sing one or more variables (the predictors)
Rlibrary(tidyverse)
diabetes <- read_csv("https://raw.githubusercontent.com/STATS-UOA/databunker/master/data/diabetes.csv")
The data
diabetes$group <- factor(diabetes$group)
diabetes
## # A tibble: 144 × 7
## id relwt glufast glutest steady insulin group
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
## 1 1 0.81 80 356 124 55 3
## 2 3 0.94 105 319 143 105 3
## 3 5 1 90 323 240 143 3
## 4 7 0.91 100 350 221 119 3
## 5 9 0.99 97 379 142 98 3
## 6 11 0.9 91 353 221 53 3
## 7 13 0.96 78 290 136 142 3
## 8 15 0.74 86 312 208 68 3
## 9 17 1.1 90 364 152 76 3
## 10 19 0.83 85 296 116 60 3
## # ℹ 134 more rows
Some variables can predict group of a patient
ggplot(reshape2::melt(diabetes, id.vars = c("id", "group")),
aes(x = value, col = group)) +
geom_density() + facet_wrap( ~variable, ncol = 1, scales = "free") +
theme(legend.position = "bottom")
Possible classification rules?
ggplot(diabetes, mapping = aes(x = insulin, y = glutest)) +
theme_bw() +
geom_point(aes(colour = group), size = 3) +
labs( x = "insulin" , y = "glutest") +
theme(axis.title = element_text( size = 16),
axis.text = element_text(size = 12))
Some similarity to regression
library(MASS)
diabetes_lda <- lda(group ~ insulin + glutest, data = diabetes)
diabetes_lda
## Call:
## lda(group ~ insulin + glutest, data = diabetes)
##
## Prior probabilities of groups:
## 1 2 3
## 0.2222222 0.2500000 0.5277778
##
## Group means:
## insulin glutest
## 1 320.9375 1027.3750
## 2 208.9722 493.9444
## 3 114.0000 349.9737
##
## Coefficients of linear discriminants:
## LD1 LD2
## insulin -0.004463900 -0.01591192
## glutest -0.005784238 0.00480830
##
## Proportion of trace:
## LD1 LD2
## 0.9677 0.0323
Components of diabetes_lda
diabetes_lda$prior gives the prior probabilities of belonging to each group. By default these reflect the proportions of membership in the data:prop.table(table(diabetes$group))
##
## 1 2 3
## 0.2222222 0.2500000 0.5277778
–> randomly chosen subject has probability 0.52 of coming from group 3
diabetes_lda$mean gives the means of each predictor in each group:Proportion of Trace gives the percentage separation achieved by each discriminant function
diabetes_lda$scaling contains the linear discriminant functions (i.e., the linear combination of variables giving best separation between groups):
diabetes_lda$scaling
## LD1 LD2
## insulin -0.004463900 -0.01591192
## glutest -0.005784238 0.00480830
i.e.,
LD1: \(-0.00446 \times \text{insulin} - 0.00578 \times \text{glutest}\)
LD2: \(-0.01591 \times \text{insulin} + 0.00481 \times \text{glutest}\)
How well does LDA do on training data?
ghat <- predict(diabetes_lda)$class
table(prediced = ghat, observed = diabetes$group)
## observed
## prediced 1 2 3
## 1 25 0 0
## 2 6 24 6
## 3 1 12 70
The missclassification rate is therefore
mean(ghat != diabetes$group)
## [1] 0.1736111
diabetes.pred <- predict(diabetes_lda)
str(diabetes.pred)
## List of 3
## $ class : Factor w/ 3 levels "1","2","3": 3 3 3 3 3 3 3 3 3 3 ...
## $ posterior: num [1:144, 1:3] 0.00000104 0.00000106 0.00000247 0.00000326 0.00000477 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:144] "1" "2" "3" "4" ...
## .. ..$ : chr [1:3] "1" "2" "3"
## $ x : num [1:144, 1:2] 1.62 1.61 1.42 1.37 1.29 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:144] "1" "2" "3" "4" ...
## .. ..$ : chr [1:2] "LD1" "LD2"
$class: predicted group for each observation$posterior: probability of falling into each group$x: matrix with 2 columns one for each LD scoreOutput
The three ellipses represent the class centres and the covariance matrix of the LDA model. Note there is only one covariance matrix, which is the same for all three classes. This results in
A key assumption of LDA is that the correlations between variables are the same in each group (i.e., common covariance matrix).
Recall that, by default, the prior probabilities are the initial proportions. What if we set equal prior probabilities?
The confusion matrix/missclassification rate:
equal.ghat <- predict(diabetes_lda, prior = rep(1,3)/3)$class
table(predicted = equal.ghat,observed = diabetes$group)
## observed
## predicted 1 2 3
## 1 25 0 0
## 2 7 28 9
## 3 0 8 67
## missclassification rate
mean(equal.ghat != diabetes$group)
## [1] 0.1666667
There are now 8 cases classified as Group 3 with prior weights classified as Group 2 with equal weights \(\rightarrow\) bias towards group with larger initial size.
Learning objectives
R for the following estimation techniques
Other resources
This section is a recap only, if you need a more in-depth overiew of matrix algebra then use the extra materials provided at the start of this module.
Matrices are commonly written in box brackets or parentheses and are typically denoted by upper case bold letters (e.g., A) with elements represented by the corresponding lower case indexed letters:
\[\mathbf {A} = \begin{bmatrix}a_{11}&a_{12}&\cdots &a_{1n}\\a_{21}&a_{22}&\cdots &a_{2n}\\\vdots &\vdots &\ddots &\vdots \\a_{m1}&a_{m2}&\cdots &a_{mn}\end{bmatrix}\]
The entry in the \(i^{th}\) row and \(j^{th}\) column of a matrix A is often referred to as the \((i,j)^{th}\) entry of the matrix, and most commonly denoted as \(a_{i,j}\), or \(a_{ij}\).
Matrix addition
The sum A + B of two m-by-n matrices A and B: \[\begin{array}{rl} \mathbf {A} + \mathbf {B} &= \begin{bmatrix}a_{11}&a_{12}&\cdots &a_{1n}\\a_{21}&a_{22}&\cdots &a_{2n}\\\vdots &\vdots &\ddots &\vdots \\a_{m1}&a_{m2}&\cdots &a_{mn}\end{bmatrix} + \begin{bmatrix}b_{11}&b_{12}&\cdots &b_{1n}\\b_{21}&b_{22}&\cdots &b_{2n}\\\vdots &\vdots &\ddots &\vdots \\b_{m1}&b_{m2}&\cdots &b_{mn}\end{bmatrix} \\ &= \begin{bmatrix}a_{11} +b_{11}&a_{12}+b_{12}&\cdots &a_{1n}+b_{1n}\\a_{21}+b_{21}&a_{22}+b_{22}&\cdots &a_{2n}+b_{2n}\\\vdots &\vdots &\ddots &\vdots \\a_{m1}+b_{m1}&a_{m2}+b_{m2}&\cdots &a_{mn}+b_{mn}\end{bmatrix}\end{array}\]
Scalar multiplication
The product cA of a number c and a matrix A:
\[c\times \mathbf {A} = \begin{bmatrix}c\times a_{11}&c\times a_{12}&\cdots &c\times a_{1n}\\c\times a_{21}&c\times a_{22}&\cdots &c\times a_{2n}\\\vdots &\vdots &\ddots &\vdots \\c\times a_{m1}&c\times a_{m2}&\cdots &c\times a_{mn}\end{bmatrix}\]
Matrix transposition
The transpose of an m-by-n matrix A is the n-by-m matrix A\(^T\):
\[\mathbf {A}^\text{T} = \begin{bmatrix}a_{11}&a_{12}&\cdots &a_{1n}\\a_{21}&a_{22}&\cdots &a_{2n}\\\vdots &\vdots &\ddots &\vdots \\a_{m1}&a_{m2}&\cdots &a_{mn}\end{bmatrix}^\text{T} = \begin{bmatrix}a_{11}&a_{21}&\cdots &a_{m1}\\a_{12}&a_{22}&\cdots &a_{m2}\\\vdots &\vdots &\ddots &\vdots \\a_{1n}&a_{2n}&\cdots &a_{mn}\end{bmatrix}\]
Matrix multiplication
Multiplication of two matrices is defined if and only if (\(iif\)) the number of columns of the left matrix is the same as the number of rows of the right matrix. If A is an m-by-n matrix and B is an n-by-p matrix, then their matrix product AB is the m-by-p matrix whose entries are given by dot product of the corresponding row of A and the corresponding column of B:
\[[\mathbf {AB} ]_{i,j}=a_{i,1}b_{1,j}+a_{i,2}b_{2,j}+\cdots +a_{i,n}b_{n,j}=\sum _{r=1}^{n}a_{i,r}b_{r,j}\]
Note Matrix multiplication satisfies the rules (AB)C = A(BC) (associativity), and (A + B)C = AC + BC as well as C(A + B) = CA + CB (left and right distributivity), whenever the size of the matrices is such that the various products are defined. The product AB may be defined without BA being defined, namely if A and B are m-by-n and n-by-k matrices, respectively, and m \(\neq\) k. Even if both products are defined, they generally need not be equal, that is: AB \(\neq\) BA.
Recall the linear regression (with a simple explanatory variable) equation from Module 2:
\[Y_i = \alpha + \beta_1x_i + \epsilon_i\] where
\[\epsilon_i \sim \text{Normal}(0,\sigma^2).\]
Here for observation \(i\)
Recall, that the Euclidean distance between two points is calculated as
\[\sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2}\]
The method of least squares estimation works by minimising the distances between each observation and the line of fit (i.e., the residuals). The line-of best fit is the line with the smallest residual sum (i.e., all the possible red lines in the figures)!
Let’s consider the CRD outlined in the previous module, we can write the effects model using matrix representation:
\[\boldsymbol{y} = \boldsymbol{X\beta} + \boldsymbol{\epsilon}\]
where
\(\boldsymbol{y} = \begin{bmatrix} y_{11} \\ y_{12} \\ y_{13} \\ y_{14} \\ y_{21} \\ y_{22} \\ y_{23} \\ y_{24} \\ y_{31} \\ y_{32} \\ y_{33} \\ y_{34} \end{bmatrix}\), \(\boldsymbol{X} = \begin{bmatrix} 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 \\ 1 & 0 & 0 & 1 \\ 1 & 0 & 0 & 1 \\ 1 & 0 & 0 & 1 \end{bmatrix}\), \(\boldsymbol{\beta} = \begin{bmatrix} \alpha \\ \tau_1 \\ \tau_2 \\ \tau_3 \end{bmatrix}\), and \(\boldsymbol{\epsilon} = \begin{bmatrix} \epsilon_{11} \\ \epsilon_{12} \\ \epsilon_{13} \\ \epsilon_{14} \\ \epsilon_{21} \\ \epsilon_{22} \\ \epsilon_{23} \\ \epsilon_{24} \\ \epsilon_{31} \\ \epsilon_{32} \\ \epsilon_{33} \\ \epsilon_{34} \end{bmatrix}.\)
where \(\boldsymbol{\epsilon} \sim \text{MVN}(\boldsymbol{0}, \boldsymbol{\sigma^2 I})\). The least squares estimators of \(\boldsymbol{\beta}\) are the solutions to the \[\boldsymbol{X^{'}X\beta}=\boldsymbol{X^{'}y}\].
Recall that for a factor variable we take the one level (the first factor) into the baseline (i.e., the standard) and hence then the coefficients we estimate are compared to it (i.e., the differences in the mean). This is to ensure that the matrix \(\boldsymbol{X}\) is full rank. So
\[\boldsymbol{X} = \begin{bmatrix} 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 0 \\ 1 & 1 & 0 \\ 1 & 1 & 0 \\ 1 & 1 & 0 \\ 1 & 1 & 0 \\ 1 & 0 & 1 \\ 1 & 0 & 1 \\ 1 & 0 & 1 \\ 1 & 0 & 1 \end{bmatrix}\]
Now, with three factor levels the least squares estimators of \(\boldsymbol{\beta}\) are (note the hat to denote the estimates)
\[(\boldsymbol{X^{'}X})^{-1}\boldsymbol{X^{'}y} = \boldsymbol{\hat{\beta}}= \begin{bmatrix} \hat{\alpha} - \hat{\tau_1} \\ \hat{\tau_2} - \hat{\tau_1} \\ \hat{\tau_3} - \hat{\tau_1} \end{bmatrix} \]
Using the coffee data from the previous chapter we have
| Beans | Strength |
|---|---|
| Type 1 | 5.3 |
| Type 1 | 6.0 |
| Type 1 | 6.6 |
| Type 1 | 4.9 |
| Type 2 | 7.5 |
| Type 2 | 7.1 |
| Type 2 | 6.3 |
| Type 2 | 7.6 |
| Type 3 | 8.8 |
| Type 3 | 6.9 |
| Type 3 | 9.2 |
| Type 3 | 10.3 |
so
\(\boldsymbol{X^{'}X} = \begin{bmatrix}12&4&4 \\4&4&0 \\4&0&4 \\\end{bmatrix}\), \(\boldsymbol{X^{'}y} = \begin{bmatrix}86.5 \\28.5 \\35.2 \\\end{bmatrix}\), and \((\boldsymbol{X^{'}X})^{-1} = \begin{bmatrix}0.25&-0.25&-0.25 \\-0.25&0.5&0.25 \\-0.25&0.25&0.5 \\\end{bmatrix}.\)
Therefore,
\[\boldsymbol{\hat{\beta}} = (\boldsymbol{X^{'}X})^{-1}\boldsymbol{X^{'}y} = \begin{bmatrix} \hat{\alpha} - \hat{\tau_1} \\ \hat{\tau_2} - \hat{\tau_1} \\ \hat{\tau_3} - \hat{\tau_1} \end{bmatrix} = \begin{bmatrix}0.25&-0.25&-0.25 \\-0.25&0.5&0.25 \\-0.25&0.25&0.5 \\\end{bmatrix} \times \begin{bmatrix}86.5 \\28.5 \\35.2 \\\end{bmatrix} = \begin{bmatrix}5.7 \\1.425 \\3.1 \\\end{bmatrix}\]
lm in RLinear least squares estimation can be carried out in R by simply using the function lm():
mod <- lm(Strength ~ Beans, data = df)
summary(mod)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.700 0.4934994 11.550166 0.000001065454
## BeansType 2 1.425 0.6979136 2.041800 0.071556522658
## BeansType 3 3.100 0.6979136 4.441811 0.001619249110
Under the assumptions of a linear model then maximum likelihood estimation is equivalent to least squares. However, (as we’ll see in a later module) we often need to be more flexible!
This section is a recap only, if you need a more in-depth overview of differentiation then use the extra materials provided at the start of this module.
The constant factor rule
\[(af(x))'=af'(x)\] The sum rule
\[(f(x)+g(x))'=f'(x)+g'(x)\] The subtraction rule
\[(f(x) - g(x))'=f'(x) - g'(x)\]
The product rule
For the functions \(f\) and \(g\), the derivative of the function \(h(x) = f(x)g(x)\) with respect to \(x\) is
\[h'(x) = (fg)'(x) = f'(x)g(x) + f(x)g'(x)\]
The chain rule
The derivative of the function \(h(x)=f(g(x))h(x)=f(g(x))\) is
\[h'(x)=f'(g(x)\cdot g'(x)\]
In summary, for any functions \(f\) and \(g\) and any real numbers \(a\) and \(b\), the derivative of the function \(h(x) = a f(x)+bg(x)\) with respect to \(x\) is \(h'(x)=af'(x) + bg'(x).\)
This section is a recap only, if you need a more in-depth overview of log rules then use the extra materials provided at the start of this module.
The process of using observations to suggest a value for a parameter is called estimation. The value suggested is called the estimate of the parameter.
Let us consider data from the published article Influence of predator identity on the strength of predator avoidance responses in lobsters..
The authors were interested in how a juvenile lobster’s size was related to its vulnerability to predation. In total, 159 juvenile lobsters were collected from their natural habitat in the Gulf of Maine, USA, and the length of each lobster’s carapace (upper shell) was measured to the nearest 3 mm. The lobsters were then tethered to the ocean floor for 24 hours. Any missing lobsters were assumed to have been consumed by a predator, while the surviving lobsters were released.
library(tidyverse)
data <- read_csv("https://raw.githubusercontent.com/STATS-UOA/databunker/master/data/lobster.csv")
data %>%
mutate(survived = ifelse(survived == 0, "consumed", "alive")) %>%
group_by(survived) %>%
tally() %>%
pivot_wider(names_from = c(survived), values_from = n)
## # A tibble: 1 × 2
## alive consumed
## <int> <int>
## 1 79 80
So, 79 of the small juvenile lobsters survived predation from a total of 159. We are interested in the probability of survival, \(\theta\), for the general population. The obvious estimate is simply to take the ratio, \(\frac{\text{number surviving juveniles}}{\text{total number juveniles}} = \frac{79}{159} = 0.4968553\)!!
There are, however, many situations where our common sense fails us. For example, what would we do if we had a regression-model situation as in Module 2 and would like to specify an alternative form for \(\theta\), such as
\[\theta = \alpha + \beta \times (\text{lobster species}).\]
How would we estimate the unknown intercept \(\alpha\) and slope(s) \(\beta\), assuming we had information on lobster species etc. For this we need a general framework for estimation that can be applied to any situation. The most useful and general method of obtaining parameter estimates is the method of maximum likelihood estimation. I mean the title of this section was a bit of a giveaway!
The likelihood function, \(L(\theta ; x)\) is a function of the parameter(s) \(\theta\) for fixed data \(x\) and it gives the probability of a fixed observation \(x\) for every possible value of the parameter(s) \(\theta\), \(P(X = x)\).
From above, letting \(S\) be the number of lobsters alive after the 24 hours we can assume a Binomial distribution:
\[L(\theta ; s) = P(S = s) = {n \choose s} \theta^s (1 - \theta)^{n-s}.\]
To obtain the maximum likelihood estimate (i.e., the best guess for \(\theta\) given the observed data) we need to differentiate the likelihood. That is, find it’s rate of change given data \(x\), \(\frac{\delta L}{\delta \theta}\) . We are interested in the best guess for \(\theta\); this occurs at the point when the rate of change of our likelihood is zero (i.e., \(\frac{\delta L}{\delta \theta} = 0\))
Using the product rule we differentiate \(L(\theta ; s)\):
\[ \begin{array}{cl} \frac{\delta L}{\delta \theta} &= {n \choose s}\left( s\theta^{s-1} (1 - \theta)^{n-s} + \theta^s (n-s) (1 - \theta)^{n-s-1}(-1) \right)\\ &= (1 - \theta)^{n-s-1}\theta^{s-1}\{s(1-\theta) - (n-s)\theta\} \\ &= (1 - \theta)^{n-s-1}\theta^{s-1}\{s - s\theta - n\theta + s\theta \}\\ &= (1 - \theta)^{n-s-1}\theta^{s-1}\{s - n\theta \}. \end{array}\]
Setting \(\frac{\delta L}{\delta \theta} = 0\) and solve for \(\theta\):
\[\frac{\delta L}{\delta \theta} = (1 - \theta)^{n-s-1}\theta^{s-1}\{s - n\theta \} = 0.\] There are, technically, three possible solutions to this:
Now, remember \(\theta\) is a probability and we are after a maximum likelihood estimate based on the data, so based on the above our best guess for \(\theta\) is \[\hat{\theta} = \frac{s}{n}.\] Using this for the lobster data we get \(\hat{\theta} = \frac{79}{159}.\) Surprise surprise!!
Using R
## define the likelihood as a function of the parameter(s)
## Luckily the Binomial likelihood is already defined in R
## as dbinom()
likelihood <- function(theta) dbinom(x = 79, size = 159, prob = theta)
## Use the optimise function to optimise!!
## the second argument specifies the plausible
## interval for the parameter
## Note for a number of parameters > 1
## the optim() function is used
optimise(likelihood, c(0,1), maximum = TRUE)
## $maximum
## [1] 0.4968541
##
## $objective
## [1] 0.06317819
Recall the following CRD equation
\[Y_{ik} = \mu_k + \epsilon_{ik}\]
where \(Y_{ik}\) is the response for the \(k^{th}\) experimental unit (\(k = 1, ..., r_i\), where \(r_i\) is the number of experimental replications in the \(i^{th}\) level of the treatment factor) subjected to the \(i^{th}\) level of the treatment factor (\(i = 1, ..., t\),). Here \(\mu_i\) are the different (cell) means for each level of the treatment factor.
Under the assumptions of a the CRD (i.e., \(\epsilon_{ik} \sim N(0, \sigma^2)\)) then (for equal number of replicates) the estimates of the cell means (\(\mu_k\)) are found by minimising the error of the sum of squares \[SS_{\epsilon} = \Sigma_{i=1}^t \Sigma_{k=1}^{r_i}(y_{ik}-\mu_i)^2.\] Taking the partial derivatives of \(SS_{\epsilon}\) with respect to each cell mean, setting to zero, and solving each equation with give us our estimates: \[\frac{\delta SS_{\epsilon}}{\delta \mu_i} = -2 \Sigma_{i=1}^t \Sigma_{k=1}^{r_i}(y_{ik}-\mu_i) = 0.\] This works out as \(\hat{\mu_i} = \overline{y_i.}\)
So in our mask example \(\hat{\mu}_{\text{Type 1 }} = 5.7 ,\; \hat{\mu}_{\text{Type 2 }} = 7.125 \; , \&\; \hat{\mu}_{\text{Type 3 }} = 8.8 .\) Compare these estimates to those we obtained via least squares estimation in the previously.
In many situations differentiating the likelihood is tricky. Even the simple binomial example above was a bit finicky. So. we often deal with the log-likelihood (the log of the likelihood). Why?
Let’s consider the binomial example above. We had \(L(\theta ; s) = {n \choose s} \theta^s (1 - \theta)^{n-s}.\)
Therefore, \[\begin{array}{cl} \text{log}(L(\theta ; s)) &= \text{log}({n \choose s}) + \text{log}(\theta^s) + \text{log}((1 - \theta)^{n-s})\\ &= \text{log}({n \choose s}) + s\text{log}(\theta) + (n-s)\text{log}(1 - \theta). \end{array}\]
Differentiating this: \[\begin{array}{cl} \frac{\delta \text{log}(L(\theta ; s))}{\delta \theta} &= 0 + \frac{s}{\theta} \times 1 + \frac{n-s}{1-\theta}\times (-1)\\ &= \frac{s}{\theta} - \frac{n-s}{1-\theta} \\ \end{array}\]
Setting this to zero we get \(\frac{s}{\theta} = \frac{n-s}{1-\theta} \rightarrow s(1-\theta) = \theta(n-s) \rightarrow s - s\theta = \theta n - s\theta \rightarrow s + (s\theta - s\theta) = \theta n \rightarrow \theta = \frac{s}{n}.\) Therefore, as above \[\hat{\theta} = \frac{s}{n}.\]
The Poisson process counts the number of events occurring in a fixed time or space, when events occur independently and at a constant average rate, \(\lambda\).
For \(X \sim \text{Poisson}(\lambda)\),
\[f_X(x) = P(X=x)=\frac{\lambda^x}{x!}e^{-\lambda}\]
for \(x = 0,1,2,\dots\)
Suppose that \(x_1, \ldots, x_n\) are iid observations from a Poisson distribution with unknown parameter \(\lambda\):
\[ L(\lambda\,; x_1, \ldots, x_n) = K e^{-n\lambda} \, \lambda^{n \overline{x}} \,,\]
where \(\overline{x} = \frac{1}{n} \sum_{i=1}^n x_i\), and \(K=\prod_{i=1}^n \frac1{x_i\,!}\) is a constant that doesn’t depend on \(\lambda\).
We differentiate \(L(\lambda\,; x_1, \ldots, x_n)\) and set to 0 to find the MLE:
\[\begin{array}{rl} 0 &= \frac{\delta}{\delta\lambda} L(\lambda\,; x_1, \ldots, x_n) \\ &= K \left( -n e^{-n\lambda} \, \lambda^{n \overline{x}} + n\overline{x} e^{-n\lambda} \, \lambda^{n \overline{x} - 1} \right)\\ &= K e^{-n\lambda} \lambda^{n \overline{x} - 1} \left(- n\lambda + n\overline{x} \right) \end{array}\]
\(\rightarrow \lambda=\infty, \lambda=0, \text{or } \lambda = \overline{x}.\)
If we know that \(L(\lambda\,; x_1, \ldots, x_n)\) reaches a unique maximum in \(0 < \lambda < \infty\) then the MLE is \(\overline{x}\).
So the maximum likelihood estimator is \[ \hat{\lambda} = \overline{x} = \frac{x_1 + \ldots + x_n}{n} \;.\]
As above,
\[L(\lambda\,; x_1, \ldots, x_n) = \prod_{i=1}^n \frac{\lambda^{x_i}}{x_i\,!} e^{-\lambda}.\] Therefore,
\[\begin{array}{rl}\text{log}(L(\lambda\,; x_1, \ldots, x_n)) &= \sum_{i=1}^n \text{log}(\frac{\lambda^{x_i}}{x_i\,!} e^{-\lambda})\\ &= \sum_{i=1}^n \text{log}(\frac{1}{x_i\,!}) + \text{log}(\lambda^{x_i}) + \text{log}(e^{-\lambda}) \\ &= \sum_{i=1}^n \text{log}(\frac{1}{x_i\,!}) + x_i \text{log}(\lambda) + (-\lambda) \\ &= K' + \text{log}(\lambda) \sum_{i=1}^n x_i - n \lambda \quad \mbox{ where $K'$ is a constant} \\ &= K' + \text{log}(\lambda) n\overline{x} - n \lambda. \end{array}\]
Differentiate and set to 0 for the MLE:
\[\begin{array}{rcl} 0 &=& \frac{\delta}{\delta\lambda} \text{log} (L(\lambda\,; x_1, \ldots, x_n) ) \\ &=& \frac{\delta}{\delta\lambda} \left (K' + \text{log}(\lambda) n\overline{x} - n \lambda \right )\\ &=& \frac{n\overline{x}}{\lambda} - n \\ \end{array}\] assuming a unique maximum in \(0 < \lambda < \infty\) the MLE is \(\hat{\lambda} = \overline{x}\) as before.
Let \(X\) be a continuous random variable with probability density function \[f_X(x) = \left\{ \begin{array}{cl} \frac{2(s-x)}{s^2} & \mbox{for } 0 < x < s\,, \\ 0 & \mbox{otherwise}\,. \end{array} \right.\]
Here, \(s\) is a parameter to be estimated, where \(s\) is the maximum value of \(X\) and \(s>0\).
Assuming a single observation \(X=x\) the likelihood function is
\[ L(s\,;x) = \frac{2(s-x)}{s^2}\] for \(\;x < s < \infty.\)
Differentiating this
\[\begin{array}{rl} \frac{dL}{ds} &= 2 \left(-2 s^{-3}(s-x) + s^{-2}\right ) \\ &= 2s^{-3} (-2(s-x) + s) \\ &= \frac{2}{s^3} (2x-s). \end{array}\]
At the MLE, \[ \frac{\delta L}{\delta s} = 0 \implies s=\infty \quad\mbox{ or }\quad s = 2x.\]
Realistically \(s=\infty\) is not the maximum (see graph below) so \(s = 2x.\) Therefore maximum likelihood estimator is \[\hat{s} = 2X.\]
Using R to get the MLE
Let’s say we observe \(X = 3\), then to find the MLE using R we use
## define the likelihood as a function of the parameter(s)
likelihood <- function(s) (2*(s - 3))/s^2
## Use the optimise function to optimise
## the second argument specifies the plausible
## interval for the parameter
## Note for a number of parameters > 1
## the optim() function is used
optimise(likelihood, c(1,8), maximum = TRUE)
## $maximum
## [1] 6.000018
##
## $objective
## [1] 0.1666667
How does this to compare to the exact estimator, \(\hat{s} = 2X\), we found using calculus above? Consider, too, the plot below showing \(L(s; X = 3)\).
“Critical thinking is an active and ongoing process. It librarys that we all think like Bayesians, updating our knowledge as new information comes in.”
The probability of the event \(A\) occurring given that the event \(B\) has already occurred is \[P(A∣B) = \frac{P(A \:\text{and}\: B)}{P(B)}\]. This is called a conditional probability. Note that \(P(A∣B)\) is not the same as \(P(B∣A)\)
An example
Rapid antigen (lateral flow) tests are rapid antigen tests used to detect SARS-COV-2 infection (COVID-19). They are very easy to use and a lot less uncomfortable than having a swab for a PCR taken!
In summary, the lateral flow test can show a positive (\(+\)) or a negative (\(-\)) result. The person taking the test either has (infected) or does not have COVID-19 (not infected).
It is reported that the average sensitivity of the Innova lateral flow tests is \(\sim 0.787\). Breaking this down means that given you have SARS-CoV-2 (Covid19) the chance of a positive lateral flow test is 0.787.
It was also reported that the specificity of this test was 0.997. That is, the chance of a negative test given that you do not have COVID-19 is 0.997.
This can be summarised as \(P( + | \text{infected}) = 0.787\) and \(P( -| \text{not infected}) = 0.997\)
What you would probably like to know is given that the test is positive, what is the probability that you have COVID-19? \(P( \text{infected}| +) = ?\)
Let’s assume that people in the population with COVID-19 is 10% (not far off the estimated % with Omicron in London a few weeks ago); that is, \(P(\text{infected}) = 0.1\).
But, what about \(P( \text{infected}| +) ?\)
Recall, \[P(A∣B) = \frac{P(A \:\text{and}\: B)}{P(B)}.\] So,
\[P( \text{infected}| +) = \frac{P(\text{infected} \:\text{and}\: +)}{P(+)}.\]
We have, \(P(\text{infected} \:\text{and}\: +) = P(\text{infected})\times P( + | \text{infected}) = 0.1 \times 0.787 = 0.0787.\)
So, \(P(+) = P(\text{infected} \:\text{and}\: +) + P(\text{clear} \:\text{and}\: +) =\) \(0.0787 + ( 0.9 \times (1 - 0.997)) = 0.0787 + ( 0.9 \times 0.003) =\) \(0.0787 + 0.0027 = 0.0814.\)
Therefore, \(P( \text{infected}| +) = \frac{ 0.0787}{0.0814} = 0.9668305.\)
Rearranging,
\[P( \text{infected}| +) = \frac{P( + | \text{infected})P(\text{infected})}{P(+)}.\]
Bayes’ theorem, named after British mathematician Reverend Thomas Bayes, is a mathematical formula for determining conditional probability. His work and theorems were presented in An Essay towards solving a Problem in the Doctrine of Chances, this was read to the Royal Society in 1763 after his death.
Theorem, Bayes’ rule The conditional probability of the event \(A\) conditional on the event \(B\) is given by
\[P(B|A) = \frac{P(A|B)P(B)}{P(A)}\]
Let’s think of this in terms of our data and hypothesis:
\[P(\text{hypothesis}∣\text{data}) = \frac{P(\text{data} | \text{hypothesis} )P(\text{hypothesis})}{P(\text{data})}\]
Recall from the previous sections that our hypotheses relate to estimating parameter values (e.g., intercepts, differences in means, slopes etc ). The formula above (Bayes’ theorem) calculates the probability of the parameter(s), say \(\theta\), values given the data.
But what is the difference here to maximum likelihood estimation (i.e., MLE, the frequentist approach)? Taking an MLE approach assumes that the parameters are fixed (i.e., they have one true value); the parameters are unknown and are to be estimated. Using this approach we typically estimate a point estimate of the parameter of interest. Taking a Bayesian approach assumes that the parameters are not fixed. Instead, parameters are assumed to come from some fixed unknown distribution (i;e., a range of plausible values). This approach assumes that we have some prior beliefs about the data (even if these beliefs are uninformative). This information is introduces a priori to the modelling framework.
\[P(B|A) = \frac{P(A|B)P(B)}{P(A)}\]
Let \(A = \theta\) and \(B = \text{data}\), then the above translates to
\[P(\theta∣\text{data}) = \frac{P(\text{data} | \theta )P(\theta)}{P(\text{data})}\]
where \(P(\theta∣\text{data})\) represents what you know after having seen the data. This is called he posterior distribution and is the basis for inference, a distribution, possibly multivariate if more than one parameter (\(\theta\)). \(P(\text{data} | \theta )\) is the likelihood, think back to the previous section. \(P(\text{data})\) is called the prior distribution and represents what you know before seeing the data. The source of much discussion about the Bayesian approach. Now.
\[P(\text{data}) = \int P(\text{data}|\theta)P(\theta)d\theta\] is typically a high-dimensional integral, difficult if not impossible to calculate.
A simple example: lobsters
Again we consider data from the published article Influence of predator identity on the strength of predator avoidance responses in lobsters..
Recall that the authors were interested in how a juvenile lobster’s size was related to its vulnerability to predation. In total, 159 juvenile lobsters were collected from their natural habitat in the Gulf of Maine, USA, and the length of each lobster’s carapace (upper shell) was measured to the nearest 3 mm. The lobsters were then tethered to the ocean floor for 24 hours. Any missing lobsters were assumed to have been consumed by a predator, while the surviving lobsters were released.
We define large juvenile’s as those with carapace \(\geq\) 40 mm, and otherwise we class them as small.
library(tidyverse)
data <- read_csv("https://raw.githubusercontent.com/STATS-UOA/databunker/master/data/lobster.csv")
lobsters <- data %>%
mutate(size = ifelse(size >= 40, "large", "small")) %>%
mutate(survived = ifelse(survived == 0, "consumed", "alive")) %>%
group_by(size, survived) %>%
tally() %>%
pivot_wider(names_from = c(survived), values_from = n)
| size | alive | consumed |
|---|---|---|
| large | 56 | 23 |
| small | 23 | 57 |
So, as before, 23 of the small juvenile lobsters survived predation from a total of 80. We are interested in the probability of survival, \(\theta\), for the general population of small lobsters. The obvious estimate is simply to take the ratio, \(\frac{23}{80} = 0.2875\) . But, what are the implied stats behind our common sense estimate?
Let \(S\) be the number alive after the 24 hours, then we can assume a Binomial distribution:
\[P(S = s) = {n \choose s} \theta^s (1 - \theta)^{n-s}\]
A frequentist approach would be to maximise the likelihood with respect to \(\theta\), see the previous section. This would result in the maximum likelihood estimate (MLE) of \(\hat{\theta} = \frac{23}{80} = 0.2875\)
Using a Bayesian approach we first need to start off with a prior distribution. This should reflect our prior beliefs about the parameter(s) of interest. We know \(\theta\) is a probability, so it is a continuous random variable and that lies between zero and one.
A suitable prior distribution might be the Beta defined on the interval [0, 1]. Therefore, we assume a priori \(\theta \sim \text{Beta(a, b)}\) so that \(P(\theta) = \theta^{a−1}(1 − \theta)^{b−1}\). See the plot below for the range of shapes a Beta distribution takes with different parameter values.
Recall from above that
\[P(\theta∣\text{data}) = \frac{P(\text{data} | \theta )P(\theta)}{P(\text{data})},\] which can be written as
\[P(\theta∣\text{data}) \propto P(\text{data} | \theta )P(\theta).\] The \(\propto\) means proportional to; basically this means we can ignore any terms not containing the parameters. Therefore,
\[\begin{array}{rl} P(\theta∣s) & \propto {n \choose s} \theta^s (1 - \theta)^{n-s}\theta^{a−1}(1 − \theta)^{b-1} \\ & \propto \theta^{(a+s)−1}(1 − \theta)^{(b+n−s)−1} \end{array}.\]
So the posterior distribution for survival is \(\theta | s \sim \text{Beta}(a + s, b + n - s).\)
We’re going to choose an uninformative prior (i.e., \(\text{Beta}(1, 1)\) above). So \(\theta_{\text{small}} \sim \text{Beta}(1 + 23, 1 + 80 - 23) = \text{Beta}(24, 58).\) We want the expected value of this, which already has an explicit form:
\[\mathbb{E}[\text{Beta}(a, b)] = \frac{a}{a + b} = \frac{24}{82}.\]
How does this compare to our MLE estimate from above?
Typically, Bayesian and frequentist estimates will always agree if there is sufficient data, so long as the likelihood is not explicitly ruled out by the prior.
When choosing a prior distribution you should focus on what that prior means in the context of the research problem. Prior choice will influence the posterior distribution. Uninformative priors can be chosen if we wish to rely only on the likelihood (i.e., let the data speak for itself), which is itself subjective based on how/where data were collected. However, uninformative priors are in general unrealistic as equal weight is given to all values!
Prior choice and sensitivity is beyond the scope of this course; however, I would strongly encourage you to read the linked materials at the start of this module.
What would happen to our posterior distribution, above, if we were to choose a different prior?
some might say↩︎
Ross Ihaka has been know to express his preference for = simply due to requiring less typing↩︎
although since v.4.2.0 the base pipe does now allow for a names placeholder _↩︎
Zhao Cao, Weiran Jin, and Jack Blackwood from the 2023 cohort↩︎
https://researchscienceinnovation.nz/case-studies/relentless-science-communication-in-the-time-of-covid/index.html↩︎